icon: LiWrench
Title: Retrieval Augmented Generation (RAG)
Now that we understand how embeddings can be used to retrieve semantically related texts, it's time to explore probably the most popular and pragmatic application of embeddings: Retrieval Augmented Generation (RAG).
A Retrieval-Augmented Generation (RAG) system is a framework that enhances the accuracy and reliability of generative AI models by incorporating information from external sources.
β¦ LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on vast amounts of publicly available data, such as Wikipedia, mailing lists, textbooks, source code, and more.
β¦ However, while LLMs are trained on a vast amount of data, they are not trained on your data, which may be private or specific to the problem youβre trying to solve. This data could be behind APIs, in SQL databases, or trapped in PDFs and slide decks.
β¦ You might choose to fine-tune an LLM with your data, but:
β¦ Instead of fine-tuning, you can use a context augmentation pattern called Retrieval-Augmented Generation (RAG) to obtain more accurate text generation relevant to your specific data.
β¦ By doing so, RAG overcomes all three weaknesses of the fine-tuning approach:
LangChain provides a robust framework for building LLM applications. The framework includes many components to support common LLM operations such as prompt chaining, chat memory management, and, of course, RAG.
We recommend using LangChain
or equivalent frameworks for implementing RAG, instead of writing your code from scratch. These frameworks often offer the following benefits:
Ready-to-use Components
Components
are various modules/functions that we can use to handle many of the common operations in RAG, without having to write the code from scratch.
Contextual Compression
, Self Query
, and Parent Document
β techniques that otherwise someone would have to understand from research papers or code repositories and then translate into Python code.Community Support:
However, packages like LangChain are not without their shortcomings:
β¦ Expect a learning curve to get familiar with the framework
β¦ They are still in active development and may break your code
β¦ Less flexibility compared to writing your own code
There are 5 main steps in a typical RAG pipeline:
β¦ Use document loaders
to load data from a source as Document's.
β¦ See official documentation on LangChain's Document Loaders for different kinds of loaders for different sources.
β¦ In this particular example, we are using one of the PDF loader
from LangChain
to load the Prompt Engineering Playbook.
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://www.developer.tech.gov.sg/products/collections/data-science-and-artificial-intelligence/playbooks/prompt-engineering-playbook-beta-v3.pdf")
pages = loader.load()
β¦ The loader load each page of the PDF file as a separate Document
object. The code below shows the first page of the PDF, by using index 0.
Once we loaded documents, we'll often want to transform them to better suit our application.
β¦ The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window.
β¦ LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.
β¦ At a high level, text splitters work as following:
β¦ In the example, we are using the RecursiveCharacterTextSplitter from Langchain to split the given some_text
into chunks. The resulting segments will have a maximum size of 26 characters, with an overlap of 4 characters between adjacent chunks.
The key parameters that we often see in splitter
are the following:
chunk_size
:
chunk_size
parameter determines the maximum length (in characters) of each chunk or segment into which the document is split.chunk_size
results in more fine-grained segments, while a larger value creates larger chunks. chunk_overlap
:
chunk_overlap
parameter specifies the number of characters that overlap between adjacent chunks. chunk_overlap
value increases the overlap, allowing for smoother transitions between chunks. method
.from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(model='text-embedding-3-small')
db = Chroma.from_documents(splitted_documents, embeddings_model, persist_directory="./chroma_db")
β./chroma_dbβ
.Documents objects
(LangChain's object).Vector Store:
Vector Database:
In short, while a Vector Store is minimalistic and focused on storage, a Vector Database provides additional features and optimizations for efficient vector handling, making it suitable for applications like semantic search, recommendation systems, and retrieval-augmented generation (RAG).
For the Retrieval stage, LangChain provides a variety of retrievers
, each of which is an interface that returns documents given an unstructured query.
Document
objects as output.This is a low-level implementation
that is useful if you want to have more flexibility in customizable or developing your own retriever.
For example, if you want to only retrieve the documents of which the relevant_score
is above a specific threshold value, this method allow you to access such values, therefore you can write your own code to do the filtering or other computations before getting the final list of documents to retrieve.
retriever
objectThis is a much more common approach, where we rely on the retriever
component from Langchain to retrieve the relevant documents.
# This is a very basic retriever that return a maximum of 10 most relevant documents
retriever_basic = vectorstore.as_retriever(search_kwargs={"k": 10})
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
AzureChatOpenAI(model='gpt-3.5-turbo'),
retriever=retriever_basic
)
qa_chain.invoke("Why LLM hallucinate?")
or we can also easily write our custom Q&A prompt for generating the answer
from langchain.prompts import PromptTemplate
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)
# Run chain
qa_chain = RetrievalQA.from_chain_type(
AzureChatOpenAI(model='gpt-3.5-turbo'),
retriever=retriever_basic,
return_source_documents=True, # Make inspection of document possible
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)