Title: RAG Evaluation

  • Deep Dive into RAG
  • Improving Pre-Retrieval Processe
  • Improving Retrieval Processed
  • Improving Post-Retrieval Processed
  • RAG Evaluation
  • Further Reading: WOG RAG Playbook



Intro

  • ✦ As there are so many ways to tune our RAG pipelines, how would we know which of the changes actually lead to better performance?

  • ✦ Ragas is one of the frameworks designed to assess RAG-based applications.

    • It is a framework that provides us with the necessary ingredients to help us evaluate our RAG pipeline on a component level.
    • Ragas provides you with the tools based on the latest research for evaluating LLM-generated text to give you insights about your RAG pipeline.
image

Evaluation framework for RAG
NEW

  • ✦ What’s interesting about Ragas is that it started out as a framework for “reference-free” evaluation. That means, instead of having to rely on human-annotated ground truth labels in the evaluation dataset, Ragas leverages LLMs under the hood to conduct the evaluations.

    • ✦ To evaluate the RAG pipeline, Ragas expects the following information:

      • question: The user query that is the input of the RAG pipeline.
      • answer: The generated answer from the RAG pipeline.
      • contexts: The contexts retrieved from the external knowledge source used to answer the question.
      • ground_truths: The ground truth answer to the question. This is the only human-annotated information. This information is only required for some of the matrices.
    • ✦ Leveraging LLMs for reference-free evaluation is an active research topic.

      • While using as little human-annotated data as possible makes it a cheaper and faster evaluation method, there is still some discussion about its shortcomings, such as bias.
      • However, some papers have already shown promising results. If you are interested, you can read more on the “Related Work” section of this Ragas paper.
  • ✦ Note that the framework has expanded to provide metrics and paradigms that require ground truth labels (e.g., context_recall and answer_correctness)

  • ✦ Additionally, the framework provides you with tooling for automatic test data generation.




Evaluation Metrics

Ragas provides you with a few metrics to evaluate a RAG pipeline component-wise as well as from end-to-end.

On a component level, Ragas provides you with metrics to evaluate the retrieval component (context_relevancy and context_recall) and the generative component (faithfulness and answer_relevancy) separately.

Most (if not all of) metrics are scaled to the range between 0 and 1, with higher values indicating a better performance.

Ragas also provides you with metrics to evaluate the RAG pipeline end-to-end, such as answer semantic similarity and answer correctness.


Installation

pip install ragas

Quick Start


from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

Visit the documentation here: Introduction | Ragas