Optimizing Retrival Augmented Generation (RAG)

Parag Shah

4 min readMay 16, 2024

It was over an year ago that I wrote a post about RAG pattern to ground LLM inference in organizational data for Q&A.

It was about the basic RAG implementation as show here:

Advanced RAG

Since writing that article, I had a chance to implement few RAG use cases and optimize them as shown here:

In the begining of 2023, using Azure Open AI APIs frequently resulted into “Capacity Full” error. Suggestion from Microsoft was to use Tenacity. API routing and load balancing such as LiteLLM is a required layer in the RAG stack. It helps offers a single API for all of their supported Models and provides excellent observability and cost management.

pip install litellm

from litellm import completion
import os

## set ENV variables
os.environ["OPENAI_API_KEY"] = "your-api-key"
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
# os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

response = completion(
  model="gpt-3.5-turbo",
  # model="claude-2"
  # model="ollama/llama2", #(for local)
  # model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0",
  messages=[{ "content": "Hello, how are you?","role": "user"}]
)

# Print current cost
print(litellm._current_cost)

Optimized Chunking

One of the RAG use cases is to perform Q&A with a set of pdfs that are 100 to 300 pages long. As is the case with most long form text that lot of proper nouns are in the initial part of the document. In the later part of the document ofter refered to us “it”, or “company”, or “he, him, she, her” etc.

Summary + Chunk Approach (MultiVector Retriever )

It is a simple process of creating summary for each sub document and store it along with chunk making it an improved context.

LlamaIndex already has a Pack to implement Multi Vector Retriever optimization.

from llama_index.core.llama_pack import download_llama_pack

# download and install dependencies
SubDocSummaryPack = download_llama_pack(
    "SubDocSummaryPack", "./subdoc_summary_pack"
)

# You can use any llama-hub loader to get documents!
subdoc_summary_pack = SubDocSummaryPack(
    documents,
    parent_chunk_size=8192,  # default,
    child_chunk_size=512,  # default
    llm=OpenAI(model="gpt-3.5-turbo"),
    embed_model=OpenAIEmbedding(),
)

response = subdoc_summary_pack.run("<query>", similarity_top_k=2)

Query Optimization

Next optimization is the user query. Logs have revealed that users try to search within RAG application just like they would search for a keyword on Google, often without much context. This results in bad retrival and therefore unsatisfactory response from LLM.

Couple of ways this can be fixed is by

Rewriting User Query by using a LLM on the query with a prompt before using it for retrival.

user_question = "Tom Cruise Age?"

template = """Provide a better search query for 
web search engine to answer the given question, end 
the queries with ’**’. Question: {user_question} 
Answer: """

2. Hypothetical Document Embeddings (HyDE): When searching in the high dimentional Vector space, embeddings of Answers are close proximity to the actual embedding when compared to the embeddings of Question.

Limitation to using HyDE approach is when LLM can not generate a good answer for user query because topic is unknown to LLM.

Retrival Optimization

Maximal Marginal Relevance (MMR): MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc. It works by finding similar but diverse chunks by finding embedding that are close to input but not too close to embedding already selected.

Re-ranking Retrived Chunks: Chunks retrived may not necessarily be in an order of relevance. Adding a interim stage to rerank helps in better response from LLM. Using of Cohere’s.

As an example, trying to summarize Section 2 vs all the references of Section 2 within the 300 page long document.

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere

llm = Cohere(temperature=0)
compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "Summarize Section 2 of this document"
)

If you enjoyed reading about my experience with RAG, please reachout to me on LinkedIn.

References

https://arxiv.org/abs/2212.10496

https://medium.com/tech-that-works/maximal-marginal-relevance-to-rerank-results-in-unsupervised-keyphrase-extraction-22d95015c7c5

https://docs.cohere.com/docs/reranking-best-practices