RAG Evaluation · Data Alchemy

RAG Evaluation

Hi everyone,

I'm sure many of you are deeply involved in experimenting with or even implementing various RAG (Retrieval-Augmented Generation) systems. You've likely noticed how challenging it is to compare different systems or to make definitive statements about their effectiveness.

I'm curious to know what evaluation methods you've tried and which ones you prefer.

Personally, I've developed a small framework based on the approach suggested by Databricks (https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG). This involves using a combined score to measure correctness, completeness, and readability.

I also attempted to use the RAGAS (https://github.com/explodinggradients/ragas) project, but I found it difficult to integrate with APIs other than the official OpenAI GPT ones.

I'm looking forward to hearing your recommendations!

3 comments