Improve RAG performance on custom vocabulary

Teemu Sormunen
DataDrivenInvestor
Published in
14 min readFeb 8, 2024

--

Bad retrieval systems cause chaos, frustration, and hallucinations.

New embedding models are stronger than ever. We evaluate them thoroughly on benchmarks like MTEB. But why do the models still fail miserably when we have custom data consisting of words that are not found online, such as internal product names and IDs? In this article, we will find out the exact reasons and propose multiple solutions.

TL;DR

Out-of-box semantic search models fail when introducing custom vocabulary or terminology, or if poor chunking strategy or data modeling is used. Keyword search, query expansion, or dynamic weight adjustment with Reciprocal Rank Fusion (RRF) can be utilized to mitigate errors resulting from custom vocabulary. Fine-tuning embedding models require careful data collection and design, and should be only experimented after the previously mentioned methods have been applied.

A billion-dollar motivation for improving retrieval

Copilot for Microsoft 365 is an example of a successful RAG application. Copilot is leveraging the RAG pattern with a heavy emphasis on the R(etrieval). The real workhorse, data modeling and retrieval, has been under construction for 8 years, the Microsoft Graph.

What is RAG, really?

Retrieval Augmented Generation (RAG) consists of Retrieval, Augmentation, and Generation. First, a user question is converted to a search query. Second, relevant text data is pulled from various sources through APIs. Third, the relevant text data is inserted into the LLM (Large Language Model) prompt along with the user question. Fourth, the LLM generates a response to the user question based on the relevant text data. Finally, the answer is displayed to the user with data sources that are relevant, such that the user can easily verify the chatbot’s answer.

Basic RAG workflow consists of Retrieval, Augmentation, and Generation.

This process allows for transparency and scalability. The final step of showing the chatbot's reasoning behind the answer is crucial for transparency. The modularity of the design allows each component of the RAG architecture to be scaled independently, such that we can switch the LLM, the UI, and the Information Retrieval (IR) system or modify the knowledge base sources.

These benefits are often not easy to achieve with machine learning or black-box LLMs, making the fundamental RAG pattern likely to stand the test of time. The LLM can be switched to some other algorithm that pools the text data to answer, the UI can be switched to Augmented Reality glasses, and the IR system could consist of thousands of different APIs to call (check out Gorilla). For this example, ChatGPT is a convenient LLM for Q&A based on text, and more advanced IR systems can be left for later.

Now that we agree on what is the RAG pattern, let’s investigate the pitfalls of the IR systems, specifically with RAG.

Where vector search fails, and how to fix it

If you are unfamiliar with semantic search and chunking, you can read my previous article on generating long-term memory for a chatbot.

As an example, let’s imagine we have built a RAG customer support chatbot based on the Tesla Model 3 manual, utilizing semantic vector search. To build the vector index, we chunk the manual page-by-page.

You may follow up on the example via this Google Colab Notebook. The notebook, fine-tuned model weights, training data, and training and validation losses can all be found on Google Drive. Remember to upload all the necessary files to the Google Colab filesystem from Google Drive, so you can access them in the notebook.

The problem statement

A user enters the chat. The user has encountered an error on his Tesla car dashboard: APP_w304. The user sends a message to the Tesla chatbot.

“I see code APP_w304 on my dashboard what to do?”

query = "I see code app_w304 on my dashboard what to do?"
relevant_page, page_no, X = find_relevant_page_from_document(query)
answer = get_openai_rag_response(query, relevant_page)
print(answer)

Our Tesla chatbot first finds the most relevant page to the query based on semantic similarity, returns it, and formulates a response to the question. The semantic retriever found the wrong page, and as a result, the chatbot hallucinates.

The chatbot could not find the correct page and hallucinates.

The semantic search found page 0, but the real answer to the question can be found on page 217. Page 1 is the cover page:

Page 1: The most relevant page related to user query according to the semantic search.

The cover page has little text or other relevant information. Note: when we display indices found by the chatbot, we start from the index 0. Therefore index 0 corresponds to page 1, index 100 to page 101, and so on.

Why did the semantic search fail?

We can break page 1 into sentences and find the most similar sentences with user query. The embedding model we are using requires prepending the short query with “Represent this sentence for searching relevant passages: ” sentence. The model was trained to produce high-quality embeddings for short query to long passage (s2p) matching. This means that we prepend a sentence for the query, for the model to recognize that we are finding long passage with short query. We are using dot-product to measure the text similarity.

We extract the sentences by finding newlines \n .

page_texts = np.array([x for x in relevant_page.split('\n') if len(x) > 1])
page_ems = get_opensource_embeddings(page_texts)
q = get_opensource_embeddings(query, is_s2p=True).flatten()
v = q @ page_ems.T

top_matches = page_texts[np.argsort(v)[::-1]]
for i, m in enumerate(top_matches[:5]):
print(f"Match #{i}: {m}")
The most similar sentences to the user query.

The issue comes down to tokenization

We wanted to find the page relevant to the error code APP_w304 , but we end up matching a page with numbers and keywords that do not seem to be related to our query. The numbers 3, 0, and 4 are repeating in the most similar sentences, and this is the root of the issue — tokenization.

Tokenization first breaks text into words and sub-words, then converting them to numbers. These numbers are processed by the embedding model. The goal of tokenization is to find naturally occurring words and stems in English. We have an error code APP_w304 , which is not a proper English word. The embedding model tries to match tokens, not the natural words that we intuitively think. You can try tokenizing text with the OpenAI tokenizer at their online demo. We are using open source embedding model bge-base-en-v1.5 which uses a different tokenizer than the OpenAI one.

Let’s see what kind of tokens our error code and the most similar sentence consists of.

query_tokens = tokenizer.encode('APP_w304')
best_match_tokens = tokenizer.encode('Software version: 2023.44.30')
print_output(f"APP_w304 tokenized: {[tokenizer.decode(token) for token in query_tokens]}")
print_output(f"Software version: 2023.44.30: {[tokenizer.decode(token) for token in best_match_tokens]}")
Tokenized user query and the most similar sentence from the manual.

Now we see the text like our embedding model does — in tokenized sequence. We have a token app in the user query, and we have a token software in the matched sentence. app relates to the word application, which is semantically similar to the word software. The numbers are also close to each other, such as ##3 and ##30 . The hashtags can be interpreted as any letter, and it simply indicates that the number 3 occurs as part of some word.

The embedding model does not understand any kind of custom vocabulary, abbreviations, or acronyms, and those special words are tokenized to subwords that do not convey meaning in themselves. Some may say that the sum of the tokens is greater than the individual tokens and that individual token investigation doesn’t represent the real semantics. While this is true, investigating single tokens demonstrates the fundamental working principles under the hood.

Complementing the semantic search: Keyword search, BM25

BM25 (Best Matching 25) can help us solve the issue with custom vocabulary. It breaks down sentences into separate words and forms a search index that emphasizes the important words.

When fitting BM25 to our manual, BM25 assigns higher weights to rare words like APP_w304. This makes finding this specific code easier among numerous words. However, BM25 only looks for exact matches and doesn’t consider synonyms or overall text meaning.

As BM25 considers exact text matches, a tokenizer must be defined for BM25. Here the tokenizer takes in a piece of text, and breaks it into individual words based on whitespace “ “ and newline feed “\n”. For the BM25, we will be using rank-bm25 Python package.

from rank_bm25 import BM25Okapi

def bm25_tokenizer(sentence):
list_split_by_space = sentence.split(' ')
list_of_lists_by_newline = [token.split('\n') for token in list_split_by_space]
corpus = [word for word_list in list_of_lists_by_newline for word in word_list]
# Remove empty strings, assuming word must have at least 3 characters
corpus = [word.lower() for word in corpus if len(word) > 2]
return corpus

tokenized_corpus = [bm25_tokenizer(doc) for doc in texts]
tokenized_query = bm25_tokenizer(query)

bm25 = BM25Okapi(tokenized_corpus)

After defining the tokenizer, we can search the most relevant page and print the chatbot answer.

most_similar_page = bm25.get_top_n(tokenized_query, texts, n=1)[0]
answer = get_openai_rag_response(query, most_similar_page)
print(answer)
The chatbot found the correct page and gave a sensible answer.

The correct page is found, and the chatbot can respond.

Pre-processing is crucial when using keyword-based models, as individual keywords become the search words. Even with our minimal pre-processing, the BM25 outperformed semantic search with error codes. Pre-processing however requires a lot of work, and it can’t be extended to cover all synonyms and semantic meaning.

What if we could get the best from keyword and semantic search?

Combining semantic and keyword search

Semantic search and keyword search pay attention to different aspects of the text. We can combine any search algorithm with Reciprocal Rank Fusion (RRF). RRF works by evaluating search scores from multiple ranked results and merging them into a unified set. The details of the RRF are explained in the code below, but it’s not crucial to understand. The formal explanation is given in their 2-page research paper.

def rrf(all_rankings: list[list[int]]):
"""Takes in list of rankings produced by multiple retrieval algorithms,
and returns newly of ranked and scored items."""
scores = {} # key is the index and value is the score of that index
# 1. Take every retrieval algorithm ranking
for algorithm_ranks in all_rankings:
# 2. For each ranking, take the index and the ranked position
for rank, idx in enumerate(algorithm_ranks):
# 3. Calculate the score and add it to the index
if idx in scores:
scores[idx] += 1 / (60 + rank)
else:
scores[idx] = 1 / (60 + rank)

# 4. Sort the indices based on accumulated scores
sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)
return sorted_scores

By inserting our retrieval algorithm rankings into the RRF

new_ranks = rrf([semantic_top_5_matches_idx, keyword_top_5_matches_idx])
print_output(new_ranks)
The page and the ranking score are computed by RRF.

The semantic and keyword retrieval algorithms do not agree on ANYTHING. RRF assumes overlap between the retrieval algorithm results to work. If that assumption does not hold, we end up simply taking the top results from both algorithms, making the algorithm effectively useless. We could weigh different retrieval algorithms, but it’s not a generalized approach to solving the fundamental issue.

Our last option is to either fine-tune our retrieval algorithms to align at least partly or try to use re-ranking. The basic idea of Re-ranking is to train a model for re-ranking the final result set according to some criteria. This would introduce a third model, but we want to tackle the underlying issue with our models, not bring more models and complexity.

Let us then fine-tune our embedding model.

The AI is confused, unable to solve the problem, and lost in the desert. Could fine-tuning help our AI?

Embedding model fine-tuning on custom data

With fine-tuning, we wish to produce semantic meaning in the code APP_w304, relating it to issues with the camera.

As the code APP_w304 is related to Camera blocked or blinded, we should try to move the code closer to the word camera and other related words such as blinded. Right now APP_w304 is similar to words application and software, and as you saw, it didn’t work out well for our vector search.

Individual words in the vector (embedding) space. Words close together are defined as similar words. APP_w304 should be moved closer to the red box. Illustration.

In technical terms, we want to move the embedding of APP_w304 closer to the word camera’s embedding and further away from the word application’s embedding. To achieve this, we need training data consisting of:

  • query: a sentence including one of our error codes
  • positive passage: a text similar to our error code
  • negative passage: a text dissimilar to our error code

We should not only consider single error code, APP_w304 , as there are tens of error codes in the manual. We can either collect data manually, which is laboursome, or try to generate data automatically.

Being data scientists, let’s explore the automatic data generation options.

Generating the training data

We need to gather queries, and similar and dissimilar passages to the query. Example triplet (query, positive, negative) could be (“What’s APP_w218 error?”, “Autosteer speed limit exceeded”, “seatbelts must be fastened”). Our positive passage must be closely related to the query, and the negative passage must be either negative or hard negative. Hard negatives are sentences that are “somewhat similar” to the positive passage, with some minor differences. In this case, the hard negative could be “autosteer connection error”. Hard negatives are harder to generate, so we will proceed with generating the “easy” negatives.

The page where the example error code APP_w218 appears.

We chunked our PDF document on page level. As our semantic search considers each page as a passage, we must take the full pages as positive and negative passages. In the image above, we can see that a single troubleshooting page may contain multiple error codes. Optimally we would have a single chunk for a single error code, such that we only capture a single topic in a single chunk. That would require re-defining the chunking strategy and is out-of-scope of this tutorial.

Let’s gather all error codes from the document and generate realistic questions users might have, such as

  • ‘I see error code $errorCode. What should I do?’
  • ‘There’s code $errorCode on my dashboard. Help!’

We can take the page of the query as a positive passage, and a random page before the troubleshooting pages as a negative passage.

LLMs such as GPT-4 are often used to generate synthetic data. We will iterate through each troubleshooting page, feeding them to GPT-4 to produce queries related to the page’s error codes. The exact GPT-4 prompt can be found in the Google Colab example notebook.

An example of a single training data point can be seen in the image below

An example of a triplet containing query, positive and negative passage.

We still need to define our loss function before proceeding to training. We want to penalize when a negative passage and query embeddings are close together and reward when a positive passage and query are close. We will be using CosineEmbeddingLoss which loosely said does this.

Now we can start training our model, tracking our validation loss every epoch to make sure we are not overfitting.

Validation loss is constantly below training loss, meaning that we are not overfitting. We stop training when the loss starts flattening, at the “elbow point”.

We can see the training and validation loss going down nicely in a controlled manner.

Can we now find the correct page with the semantic search? We did not include the error code APP_w304 in our training set, to see if the model is capable of generalizing to novel error codes.

chunks, outs = get_opensource_embeddings(texts)
relevant_page, page_no, X = find_relevant_page_from_document(USER_QUERY, passage_chunks=chunks)
print_output(USER_QUERY)
print_output(str(page_no))
The most relevant page when utilizing semantic search is page 214. Page 217 would be the correct page.

We missed the correct page by one page. This is no surprise, as there are no APP_w3**error codes in the training dataset. For example, APP_w222 stands for Cruise control unavailable due to camera visibility and APP_w224 Cruise control unavailable due to lack of calibration. Without having seen APP_w223 in training data, an algorithm could theoretically interpolate the APP_w223 to be something between the issue with calibration and camera visibility. However, we do not have any other codes starting with APP_w3** , so we only get rough generalizability.

Our goal was to fine-tune our embedding model to produce semantic meaning for the error codes, such that we would have an overlap of results between the BM25, and be able to use RRF.

Did we succeed in aligning the semantic search?

Combining fine-tuned semantic and keyword search

Now we can run the RRF again, and see the top hits.

Top matches computed for BM25, semantic search, and combined with RRF. Correct page 217 is found.

Our semantic search finds matches around troubleshooting pages (215, 216, 213, 214, 249), and now we have an overlap between keyword search and semantic search results. The output of RRF finds index 216 as the best match with a score of 0.034, which maps to the correct page 217.

We were able to teach semantic meaning of the error codes to our embedding model — A Great Success.

The AI has tackled most of the question marks and is moving towards a better place. There are still things question marks floating, but now we have a foundation to tackle them.

Designing an IR system from the beginning

We managed to fine-tune our embedding model for custom vocabulary. The thing is, we shouldn’t have needed to do this in the first place. In general, we can find easier solutions by breaking the problem into its fundamentals.

How could we improve designing the retrieval system?

  1. Better chunking — We didn’t define chunks on the topical level, and there were multiple topics on each chunk (each page).
  2. Access to structural data — With IR systems, our goal is to reduce the need for generating data with black-box models and maximize the number of deterministic building blocks that we can trust.
  3. Inefficient utilization of different IR systems — We know that keyword search is helpful with exact terms and semantic search is helpful with semantics and meaning. We should find a way to dynamically utilize each system at appropriate times.

We treated our problem as if we didn’t have access to the data source based on which the PDF was generated. In real use cases, we should try to gain access to the structural data source from which the PDF was generated. First, this would allow us to chunk based on the topics themselves. If we had a mapping {$errorCode : $meaning}, we could also simply augment our query instead of fine-tuning the embedding model. Example:

“I see code APP_w304 (Camera blocked or blinded) on my dashboard what to do?”

By “augmenting the query”, we could bring the semantic meaning of the error code to the query, removing the need for vocabulary adaptation.

If we could not augment the query, another approach would be to analyze the incoming query sentence and dynamically adjust the weights of the different IR systems. If the query consists of custom vocabulary, we would assign a higher weight to keyword search and vice-versa.

Conclusions

Hooray! We connected complex topics with real-world examples, and now you are prepared to face the difficult challenges with private datasets.

Once again it turned out to be a longer article than I anticipated, and I had to cut corners. For full code examples check the Google Colab Notebook. For the next article, I plan on either diving deeper into retrieval algorithms or introducing some basic real-world issues with hallucinations and how to solve them.

Lastly, feel free to network with me on LinkedIn and follow me here if you wish to read similar articles :)

Linkedin: @sormunenteemu
Google Drive with data and notebook: RAG Beyond the obvious — Retrievers

Visit us at DataDrivenInvestor.com

Subscribe to DDIntel here.

Have a unique story to share? Submit to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--

As a large language model, I can't write a personalized bio for Medium - a data scientist obsessed working with newest AI. Latest project: https://frendi.tech