icon: LiWrench
Title: Improving Pre-Retrieval Processes
Pre-Retrieval Processes
because the "construction" or "enhancement" of the query are somethings that happen before the retrieval process.✦ As we have already seen in Naive RAG that chunks are nothing but the small parts of whole document and indexing is vector representation of these chunks which we store in Vector DB.
✦ We quote a paragraph from GovTech RAG Playbook that perfectly sums up the challenges of finding the right balance between the chunk size and the accuracy of the RAG pipeline. We included the RAG Playbook under the "Further Readings" for Topic 5.
While it is possible to obtain an embedding for a document as long as it fits into the embedding model’s context length, embedding an entire document is not always an optimal strategy. It is common to segment documents into chunks and to specify an overlap size between chunks.
Both of these parameters can help to facilitate the flow of context from one chunk to another, and the optimal chunk and overlap size to use is corpus specific. Embedding a single sentence focuses on its specific meaning but forgoes the broader context in the surrounding text. Embedding an entire body of text focuses on the overall meaning but may dilute the significance of individual sentences or phrases.
Generally, longer and more complex queries benefit from smaller chunk sizes while shorter and simpler queries may not require chunking.
Source: GovTech RAG Playbook
✦ While fixed-size chunking offers a straightforward approach, it often leads to context fragmentation, hindering the retrieval of accurate information.
✦ Also known as recursive structure aware chunking, content based chunking which can keep the context and format of text or the specific file types, such as HTML, PDF, Markdown, JSON.
✦ Simply put, using the right or suitable document splitter method for the use case will help us to derive chunks that are tailored to the specific file formats that we are dealing with.
<h1>
), paragraphs (<p>
), and tables (<table>
), enabling custom processing based on element types.CharacterTextSplitter
. ✦ Langchain
supports many of the commonly used file types. Refer to the table below:
text splitters
offered by Langchain.
Name | Splits On | Description |
---|---|---|
Recursive | A list of user defined characters | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. |
HTML | HTML specific characters | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |
Markdown | Markdown specific characters | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
Code | Code (Python, JS) specific characters | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
Chunkviz
utility. ✦ Semantic chunking is one the more sophisticated chunking methods.
✦ The easiest way to take advantage of this cutting-edge chunking approach is to use Langchain's experimental module:
!pip install --quiet langchain_experimental langchain_openai
# Load Example Data
# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
# Create Text Splitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# That's it. It is this simple.
text_splitter = SemanticChunker(OpenAIEmbeddings())
# Spit Text
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
✦ Query transformation is a method of improving quality of user query by restructuring it to improve retrieval quality.
✦ It includes techniques like:
# The main part is a rewriter to rewrite the query
prompt = """Provide a better search query for \
web search engine to answer the given question.
Question: {user_query}
"""
✦ If the query is complex and has multiple context then, retrieval with a single query may not be the good approach as it may fail to get the proper output you want.
✦ In LangChain, we can use MultiQueryRetriever for implementation of this technique. The MultiQueryRetriever
automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query.
MultiQueryRetriever
might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results.MultiQueryRetriever
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)
# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)
# This is the Core Part of the Code
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
retriever=vectordb.as_retriever(), llm=llm
)
template="""You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions separated by newlines.
Original question: {question}"""
✦ When we are having multiple vector stores / databases or various actions to perform on user query based on its context, then routing the user query in the right direction is very important for relevant retrieval and further generation.
✦ Using specific prompt and output parsers, we can use an LLM call to decide which action to perform or where to route the user query.
✦ If you're keen to use any frameworks, you can use prompt chaining or custom Agents to implement query routing in LangChain or LlamaIndex.
This note is not intended to exhaustively cover all techniques or methods available for improving Retrieval-Augmented Generation (RAG) processes.