icon: LiWrench
Title: RAG Evaluation
✦ As there are so many ways to tune our RAG pipelines, how would we know which of the changes actually lead to better performance?
✦ Ragas is one of the frameworks designed to assess RAG-based applications.
✦ What’s interesting about Ragas is that it started out as a framework for “reference-free” evaluation. That means, instead of having to rely on human-annotated ground truth labels in the evaluation dataset, Ragas leverages LLMs under the hood to conduct the evaluations.
✦ To evaluate the RAG pipeline, Ragas expects the following information:
question
: The user query that is the input of the RAG pipeline.answer
: The generated answer from the RAG pipeline. contexts
: The contexts retrieved from the external knowledge source used to answer the question
.ground_truths
: The ground truth answer to the question
. This is the only human-annotated information. This information is only required for some of the matrices.✦ Leveraging LLMs for reference-free evaluation is an active research topic.
✦ Note that the framework has expanded to provide metrics and paradigms that require ground truth labels (e.g., context_recall
and answer_correctness
)
✦ Additionally, the framework provides you with tooling for automatic test data generation.
Ragas provides you with a few metrics to evaluate a RAG pipeline component-wise as well as from end-to-end.
On a component level, Ragas provides you with metrics to evaluate the retrieval component (context_relevancy
and context_recall
) and the generative component (faithfulness
and answer_relevancy
) separately.
Most (if not all of) metrics are scaled to the range between 0 and 1, with higher values indicating a better performance.
Ragas also provides you with metrics to evaluate the RAG pipeline end-to-end, such as answer semantic similarity and answer correctness.
pip install ragas
from datasets import Dataset
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
os.environ["OPENAI_API_KEY"] = "your-openai-key"
data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'],
['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()
Visit the documentation here: Introduction | Ragas