icon: LiWrench

Title: RAG Evaluation

Deep Dive into RAG
Improving Pre-Retrieval Processe
Improving Retrieval Processed
Improving Post-Retrieval Processed
RAG Evaluation
Further Reading: WOG RAG Playbook

Intro
Evaluation Metrics
Installation
Quick Start

Intro

✦ As there are so many ways to tune our RAG pipelines, how would we know which of the changes actually lead to better performance?
✦ Ragas is one of the frameworks designed to assess RAG-based applications.
- It is a framework that provides us with the necessary ingredients to help us evaluate our RAG pipeline on a component level.
- Ragas provides you with the tools based on the latest research for evaluating LLM-generated text to give you insights about your RAG pipeline.

Evaluation framework for RAG
NEW

✦ What’s interesting about Ragas is that it started out as a framework for “reference-free” evaluation. That means, instead of having to rely on human-annotated ground truth labels in the evaluation dataset, Ragas leverages LLMs under the hood to conduct the evaluations.
- ✦ To evaluate the RAG pipeline, Ragas expects the following information:
  - question: The user query that is the input of the RAG pipeline.
  - answer: The generated answer from the RAG pipeline.
  - contexts: The contexts retrieved from the external knowledge source used to answer the question.
  - ground_truths: The ground truth answer to the question. This is the only human-annotated information. This information is only required for some of the matrices.
- ✦ Leveraging LLMs for reference-free evaluation is an active research topic.
  - While using as little human-annotated data as possible makes it a cheaper and faster evaluation method, there is still some discussion about its shortcomings, such as bias.
  - However, some papers have already shown promising results. If you are interested, you can read more on the “Related Work” section of this Ragas paper.
✦ Note that the framework has expanded to provide metrics and paradigms that require ground truth labels (e.g., context_recall and answer_correctness)
✦ Additionally, the framework provides you with tooling for automatic test data generation.

Evaluation Metrics

Ragas provides you with a few metrics to evaluate a RAG pipeline component-wise as well as from end-to-end.

On a component level, Ragas provides you with metrics to evaluate the retrieval component (context_relevancy and context_recall) and the generative component (faithfulness and answer_relevancy) separately.

Most (if not all of) metrics are scaled to the range between 0 and 1, with higher values indicating a better performance.

Ragas also provides you with metrics to evaluate the RAG pipeline end-to-end, such as answer semantic similarity and answer correctness.

Installation

pip install ragas

Quick Start


from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

Visit the documentation here: Introduction | Ragas