Title: Overview - Plan of Attack

  • Deep Dive into RAG
  • Improving Pre-Retrieval Processe
  • Improving Retrieval Processed
  • Improving Post-Retrieval Processed
  • RAG Evaluation
  • Further Reading: WOG RAG Playbook




1 Overview

  • ✦ Retrieval Augmented Generation (RAG) is emerging as a crucial framework for industries and GenAI practitioners or enthusiasts to develop applications powered by Large Language Models (LLMs).
    • It offers significant potential to utilize LLMs in an optimal and efficient manner for creating comprehensive GenAI applications, including chatbots, search engines, and more, which we have seen in the previous topic.
    • RAG enables the dynamic integration of external knowledge sources, enhancing the accuracy and relevance of responses generated by these applications.

1.1 Why RAG

  • ✦ Pre-trained foundational Large Language Models (LLMs) are developed using general-purpose data, enabling them to produce accurate and relevant responses to broad queries.
    • However, when it comes to domain-specific, external, and the most current data, these models may fall short.
    • In such instances, LLMs might generate incorrect or misleading information due to their reliance on outdated or irrelevant data sources.
    • To address these challenges, Retrieval Augmented Generation (RAG) techniques have been explored.

1.2 Quick Recaps on the Basics of RAG

  • ✦ Retrieval Augmented Generation (RAG) is a framework that enhances the capabilities of LLMs by providing them with additional, relevant contextual information alongside the original query in the form of a prompt.
    • RAG involves searching across a vast corpus of private data and retrieving results that are most similar to the query asked by the end user so that it can be passed on to the LLM as context.
    • This approach enables the LLM to better understand the user's query and its context, leading to more accurate and pertinent responses.
    • The process is akin to an open-book exam, where the model first retrieves relevant contextual data before generating an answer based on this information.

  • ✦ There are 5 main steps in RAG:
    • Document Loading
      • In this initial step, relevant documents are ingested and prepared for further processing. This process typically occurs offline.
    • Splitting & Chunking
      • The text from the documents is split into smaller chunks or segments.
      • These chunks serve as the building blocks for subsequent stages.
    • Storage
      • The embeddings (vector representations) of these chunks are created and stored in a vector store.
      • These embeddings capture the semantic meaning of the text.
    • Retrieval
      • When an online query arrives, the system retrieves relevant chunks from the vector store based on the query.
      • This retrieval step ensures that the system identifies the most pertinent information.
    • Generate Output
      • Finally, the retrieved chunks are used to generate a coherent response.
      • This output can be in the form of natural language text, summaries, or other relevant content.



2 Problems with Naive RAG or Vanilla RAG

The basic or "Vanilla" RAG, also known as Naive RAG, exhibits several limitations, particularly when applied to complex use cases or in the development of production-ready applications.

 
As we saw in the previous topic, building an RAG prototype is relatively easy – investing around 20% of the effort yields an application with 80% performance. However, achieving a further 20% performance improvement requires the remaining 80% of the effort.

 
Below are some key reasons why Naive RAG may not always deliver the most effective and optimized outcomes.


2.1 Contextual Limitations

  • ✦ One of the primary issues with Naive RAG is its handling of context.
  • ✦ In this framework, a single chunk retrieved from a vector store is expected to provide the necessary context for the LLM to generate a response.
  • ✦ However, these chunks, being mere subunits of a larger document, often contain incomplete context.
  • ✦ This partial context can result in responses that lack crucial information or, conversely, include irrelevant details, thereby diminishing the overall relevance and accuracy of the output.

2.2 Relevance vs. Similarity

  • ✦ Another significant challenge is the distinction between relevance and similarity.
  • ✦ Naive RAG operates on the principle that a high similarity score between a query and a retrieved chunk indicates a high degree of relevance.
  • ✦ Unfortunately, this is not always the case.
  • ✦ A chunk may share many keywords or concepts with the query and still fail to address the user's actual intent, leading to responses that, while technically similar, are practically unhelpful.

2.3 Query Complexity

  • ✦ The effectiveness of Naive RAG is also compromised by the complexity and structure of user queries.
  • ✦ Queries that are poorly phrased, overly complex, or contain multiple questions pose a particular challenge.
  • ✦ In such instances, the LLM may struggle to generate accurate responses on the first attempt, if at all.
  • ✦ The limitation of retrieving a single chunk exacerbates this issue, as it is unlikely to provide sufficient context for all aspects of a multifaceted query.

2.4 Impact of Chunk Ordering

  • ✦ The order in which chunks are presented to the LLM for response generation plays a critical role in the quality of the output.
  • ✦ Naive RAG does not adequately address the significance of chunk ordering, often leading to suboptimal response generation.
  • ✦ The lack of a sophisticated mechanism for determining the most effective sequence of chunks can result in responses that are disjointed or fail to build coherently on the provided context.


Keypoints

To overcome these limitations of naive RAG, there are two aspects that are essential:

    1. some improvements can be done on top of naive RAG architecture
    1. evaluation of RAG pipeline to understand how the modifications affect the performance



3 Improvements over Naive RAG

 RAG is only as good as the retrieved documents’ relevance and quality. Fortunately, an emerging set of techniques can be employed to design and improve RAG systems.

 
The improvement of RAG is not just a matter of incremental updates, by installing newer Python package or calling any functions out-of-the-box, but many of them involves a comprehensive rethinking of its architecture and processes.

We can group the various improvements under 3 major categories:

  • ✦ Pre-Retrieval Processes
  • ✦ Retrieval Process
  • ✦ Post-Retrieval Process

Each of these will be discussed in the next 3 notes.

You might also be interested in the GovTech playbook included in 6. Further Readings - WOG RAG Playbook, where the results of different techniques have been experimented on two specific use cases. This playbook can serve as a general reference point for starting your own experiments, particularly for techniques that have shown the greatest improvement in accuracy and the ability of the RAG pipeline.




4 Evaluation of RAG

  • Evaluation of RAG systems is essential to benchmark the overall performance of RAG output.

  • To evaluate RAG we can use metrics like:

    • answer relevancy,
    • faithfulness for generation and context recall,
    • precision for retrieval.
  • These metrics provide a structured way to assess the quality of the generated answers and the relevance of the information retrieved by the system.

    • However, the complexity and variability of RAG systems necessitate a more comprehensive and nuanced approach to evaluation.
  • Enter RAGAS, a framework specifically designed for this purpose.

    • RAGAS offers a suite of tools and metrics tailored to evaluate RAG-based applications at a component level, providing a clear pathway for developers or enthusiasts to assess and enhance their applications systematically, allowing developers to fine-tune their systems with more confidence.

We will go into the details of RAG evaluation in 5. RAG Evaluation