icon: LiWrench

Title: Retrieval Augmented Generation (RAG)

Embeddings
Handling Embeddings
Applying Embeddings
Retrieval Augmented Generation (RAG)
Hands-on Walkthrough and Tasks

Why Context Augmentation?
Using Langchain for RAG
Overview of Steps in RAG
Document Loading
Splits
Storage
Retrieval
- Method 1: Basic Retrieval Using Vector Store directly
- Method 2: Using the retriever object
Output

Now that we understand how embeddings can be used to retrieve semantically related texts, it's time to explore probably the most popular and pragmatic application of embeddings: Retrieval Augmented Generation (RAG).

A Retrieval-Augmented Generation (RAG) system is a framework that enhances the accuracy and reliability of generative AI models by incorporating information from external sources.

Why Context Augmentation?

✦ LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on vast amounts of publicly available data, such as Wikipedia, mailing lists, textbooks, source code, and more.
✦ However, while LLMs are trained on a vast amount of data, they are not trained on your data, which may be private or specific to the problem you’re trying to solve. This data could be behind APIs, in SQL databases, or trapped in PDFs and slide decks.
✦ You might choose to fine-tune an LLM with your data, but:
- Training an LLM is expensive.
- Due to the cost of training, it's difficult to update an LLM with the latest information.
- Observability is lacking. When you ask an LLM a question, it's not clear how the LLM arrived at its answer.
✦ Instead of fine-tuning, you can use a context augmentation pattern called Retrieval-Augmented Generation (RAG) to obtain more accurate text generation relevant to your specific data.
- RAG involves the following high-level steps:
  1. Retrieve information from your data sources first.
  2. Add it to your question as context.
  3. Ask the LLM to answer based on the enriched prompt.
✦ By doing so, RAG overcomes all three weaknesses of the fine-tuning approach:
- There's no training involved, so it's inexpensive.
- Data is fetched only when you request it, ensuring it's always up-to-date.
- It's more explainable, as most RAG frameworks allow you to display the retrieved documents, making it more trustworthy.

Using Langchain for RAG

LangChain provides a robust framework for building LLM applications. The framework includes many components to support common LLM operations such as prompt chaining, chat memory management, and, of course, RAG.

We recommend using LangChain or equivalent frameworks for implementing RAG, instead of writing your code from scratch. These frameworks often offer the following benefits:

Ready-to-use Components

✦ Components are various modules/functions that we can use to handle many of the common operations in RAG, without having to write the code from scratch.
- For example, Langchain provides components for us to easily read PDF files or PowerPoint files, connect to databases, or get the transcript of a YouTube video.
✦ Many of these components are based on contributions from large communities and research works that have proven to work effectively.
- For example, Langchain has a rich set of advanced techniques for retrieving relevant documents, such as Contextual Compression, Self Query, and Parent Document – techniques that otherwise someone would have to understand from research papers or code repositories and then translate into Python code.
✦ Using a framework like Langchain allows us to focus on the business and application logic, so we can efficiently build and evaluate our proof-of-concept prototypes.

Community Support:

✦ These popular frameworks like Langchain have active communities, providing tutorials, examples, and documentation.
✦ Whether you're a beginner or an experienced developer, you'll find resources to guide you.

However, packages like LangChain are not without their shortcomings:

✦ Expect a learning curve to get familiar with the framework
- While Langchain provides powerful tools for RAG, it does require some initial learning. Developers need to understand the components, syntax, and best practices.
✦ They are still in active development and may break your code
- Updates may introduce changes, deprecate features, or even cause backward compatibility issues.
- There are also chances where the documentation lags behind updates, or the available tutorials are based on older versions of the framework.
- The suggestion is to avoid changing the version of the installed package unless it's necessary and you're ready to fix any broken code.
✦ Less flexibility compared to writing your own code
- While Langchain streamlines RAG pipelines, it imposes certain constraints.
- Customization beyond the provided components may be limited or challenging.
- However, unless we are building something very unique, the components still serve as very useful building blocks for many common operations in LLMs.

Overview of Steps in RAG

There are 5 main steps in a typical RAG pipeline:

1. Document Loading
- In this initial step, relevant documents are ingested and prepared for further processing.
1. Splitting & Chunking
- The text from the documents is split into smaller chunks or segments.
- These chunks serve as the building blocks for subsequent stages.
1. Storage
- The embeddings (vector representations) of these chunks are created and stored in a vector store.
- These embeddings capture the semantic meaning of the text.
1. Retrieval
- When an online query arrives, the system retrieves relevant chunks from the vector store based on the query.
- This retrieval step ensures that the system identifies the most pertinent information.
1. Output
- Finally, the retrieved chunks are used to generate a coherent response.
- This output can be in the form of natural language text, summaries, or other relevant content.

1. Document Loading

✦ Use document loaders to load data from a source as Document's.
- A Document is a piece of text and associated metadata.
- For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.
✦ See official documentation on LangChain's Document Loaders for different kinds of loaders for different sources.
✦ In this particular example, we are using one of the PDF loader from LangChain to load the Prompt Engineering Playbook.

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://www.developer.tech.gov.sg/products/collections/data-science-and-artificial-intelligence/playbooks/prompt-engineering-playbook-beta-v3.pdf")
pages = loader.load()

✦ The loader load each page of the PDF file as a separate Document object. The code below shows the first page of the PDF, by using index 0.

2. Splits

Once we loaded documents, we'll often want to transform them to better suit our application.

✦ The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window.
✦ LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.
✦ At a high level, text splitters work as following:
1. Split the text up into small, semantically meaningful chunks (often sentences).
2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).
✦ In the example, we are using the RecursiveCharacterTextSplitter from Langchain to split the given some_text into chunks. The resulting segments will have a maximum size of 26 characters, with an overlap of 4 characters between adjacent chunks.

The key parameters that we often see in splitter are the following:

chunk_size:
- The chunk_size parameter determines the maximum length (in characters) of each chunk or segment into which the document is split.
- A smaller chunk_size results in more fine-grained segments, while a larger value creates larger chunks.
- Adjusting this parameter affects the granularity of the split text.
chunk_overlap:
- The chunk_overlap parameter specifies the number of characters that overlap between adjacent chunks.
- It controls how much context is shared between neighboring segments.
- A higher chunk_overlap value increases the overlap, allowing for smoother transitions between chunks.
- Conversely, a lower value reduces overlap, potentially leading to more distinct segments.

3. Storage

✦ Underlying the hood, there are two operations that happen at this step.

Get the embeddings of the text
Store the embeddings into a storage (a Vector store or a Vector Database)

✦ However, in frameworks such as LangChain, these two operations are often completed by a single method.

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import Chroma

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model='text-embedding-3-small')


db = Chroma.from_documents(splitted_documents, embeddings_model, persist_directory="./chroma_db")

✦ The last line creates a Chroma database from a collection of splitted documents.
- The database is built using the specified embeddings model and is stored in the directory “./chroma_db”.
- Chroma: Chroma is a library or tool designed for efficient similarity search and indexing of such as text embeddings.
- from_documents: This method constructs a database from a list of Documents objects (LangChain's object).
- persist_directory: Specifies the directory where the database will be stored for future use.

Differences between Vector Store and Vector Database

Vector Store:

✦ A Vector Store is a simple data structure or storage system designed specifically to hold vectors (n-dimensional numerical representations of data points).
✦ It focuses on efficient storage and retrieval of vectors without additional features.
✦ Purpose: Primarily used for vector indexing and retrieval, especially in scenarios where the primary goal is similarity search.

Vector Database:

✦ A Vector Database is a more sophisticated system that not only stores vectors but also provides additional functionalities and optimizations.
✦ It is purpose-built for handling high-dimensional vectors efficiently.
✦ Features:
- Indexing: Vector databases create indexes to speed up similarity searches.
- Scalability: They can handle large-scale vector data.
- Query Optimization: Vector databases optimize queries for similarity search.
- Machine Learning Integration: Some vector databases integrate with ML frameworks for training and inference.
✦ Examples: Pinecone, Milvus, and Weaviate are popular vector databases.

In short, while a Vector Store is minimalistic and focused on storage, a Vector Database provides additional features and optimizations for efficient vector handling, making it suitable for applications like semantic search, recommendation systems, and retrieval-augmented generation (RAG).

4. Retrieval

For the Retrieval stage, LangChain provides a variety of retrievers, each of which is an interface that returns documents given an unstructured query.

✦ Retrievers are more general than vector stores.
✦ A retriever does not need to be able to store documents, only to return (or retrieve) them.
✦ Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.
✦ Retrievers accept a string query as input and return a list of Document objects as output.

Method 1: Basic Retrieval Using Vector Store directly

This is a low-level implementation that is useful if you want to have more flexibility in customizable or developing your own retriever.

For example, if you want to only retrieve the documents of which the relevant_score is above a specific threshold value, this method allow you to access such values, therefore you can write your own code to do the filtering or other computations before getting the final list of documents to retrieve.

Method 2: Using the `retriever` object

This is a much more common approach, where we rely on the retriever component from Langchain to retrieve the relevant documents.

# This is a very basic retriever that return a maximum of 10 most relevant documents
retriever_basic = vectorstore.as_retriever(search_kwargs={"k": 10})

5. Output

from langchain.chains import RetrievalQA


qa_chain = RetrievalQA.from_chain_type(
    AzureChatOpenAI(model='gpt-3.5-turbo'),
    retriever=retriever_basic
)

qa_chain.invoke("Why LLM hallucinate?")

or we can also easily write our custom Q&A prompt for generating the answer

from langchain.prompts import PromptTemplate

  
# Build prompt

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.

{context}
Question: {question}
Helpful Answer:"""


QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


# Run chain
qa_chain = RetrievalQA.from_chain_type(
    AzureChatOpenAI(model='gpt-3.5-turbo'),
    retriever=retriever_basic,
    return_source_documents=True, # Make inspection of document possible
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)