The 10 Minute Guide to Reliable RAG Systems Using Patronus AI, MongoDB Atlas, and LlamaIndex
Today, pretty much everyone wants to use LLMs to answer questions about their documents. For example, financial analysts routinely research public company filings like SEC 10-Ks to answer questions like:
- What is the quantity of restructuring costs directly outlined in McDonald's income statements for FY2022?
- What drove operating margin change as of the FY2022 for McDonald's?
Using Retrieval-Augmented Generation (RAG) systems can be a powerful way to answer questions like these. However, these systems are prone to failure. In our research, we found that State-Of-The-Art retrieval systems frequently hallucinate, and incorrectly answer or refuse to answer up to 81% of financial analysts’ questions! Prior research has shown that LLMs struggle with hallucinations, reasoning, and numerical calculations.
So how do we test these systems in a scalable way to identify hallucinations?
Enter: Patronus AI + MongoDB Atlas 🚀
Patronus AI is the leading automated AI evaluation and security company. Our platform enables engineers to score and benchmark LLM performance on real world scenarios, generate adversarial test cases at scale, monitor hallucinations and other unexpected and unsafe behavior, and more. Customers use Patronus AI to detect LLM mistakes at scale and deploy AI products safely and confidently.
MongoDB Atlas is a developer data platform providing a suite of cloud database and data services that accelerate and simplify how you build with data. MongoDB Atlas users benefit from a flexible and intuitive document model, as well as a host of robust features, including access to automated deployments, simple configuration changes and continuous feature improvements.
MongoDB Atlas Set Up
Patronus AI Evaluation Framework
Patronus runs end-to-end evaluations of your AI system on a range of criteria. Our platform supports a wide range of evaluation criteria, including correctness, relevance, PII, enterprise PII, toxicity, and custom user-defined criteria. In this tutorial, we focus on hallucination and answer relevance for RAG applications 😀
We score model outputs on each criterion using our proprietary evaluators and provide results on model responses to test inputs. We also provide explanations for why the scores turned out the way they did.
Let’s query the “McDonald’s SEC 10-K filing” filed in Q4 2022. This was a PDF document with 100 pages.
Here are various queries you can ask about the PDF. These questions and the PDF document are taken from FinanceBench, an evaluation dataset we announced a few weeks ago! The open source sample is on HuggingFace (link).
In this tutorial, we use LlamaIndex in conjunction with MongoDB’s Atlas Vector Search. LlamaIndex is a data framework that provides an interface for ingesting and indexing datasets, which integrates well with MongoDB’s Atlas data store. We can use MongoDB’s UI to view the document stores we create. For the purposes of demonstration, we only use the McDonald’s SEC 10k filing, but you can use the same approach for a larger set of documents!
Configuring the Document Store
First, you need to download the PDF. Then, you can parse it with the PDFReader library.
Assuming you have created a MongoDB cluster, you can create a MongoDB document store by connecting to your MongoDB instance. This creates a default database called “db_docstore” (you can rename this).
This is what your document looks like in MongoDB Atlas:
There are a few ways you can construct the index for Vector Search in LlamaIndex. We construct a VectorStoreIndex, which is the most common type of index for vector DBs. The VectorStoreIndex essentially splits your documents into nodes (chunks of text with some metadata), then creates vector embeddings of each node’s text. The vector embeddings can then be stored and persisted in MongoDB. The full system overview is shown below 👇
Now we are ready to query this document and evaluate the responses!
Querying the Retrieval System
Let’s try our user queries with our newly constructed RAG system.
First, you need to load a vector store from MongoDB documents:
Then, you can provide the user queries and view responses:
Here are some examples of the RAG system’s outputs:
Now that we have our system up and running, the next question is: How do we know whether the outputs are good? We can use the Patronus API to evaluate the quality of RAG outputs ✅
Reach out to us at firstname.lastname@example.org to get an API key!
Below, we show a sample output. Note that the retrieved context is shortened.
In this case, the response was incorrect because there was no information in the context to support the claim that McDonald’s is capital intensive. However, the response was relevant to the input. In both cases, Patronus correctly scored the model output, reducing time spent on human evaluation.
The other neat part about Patronus’ API is that it works lightning fast, allowing you to quickly identify hallucinations at scale 🙂Although we describe just a single example above, you can use the Patronus API to iteratively test as you experiment with new RAG system design choices, new data, different base prompts, and more.
OK. I found hallucinations. How do I fix them now?
There are a number of ways to improve the performance of your RAG system. Here are a few options:
- Explore different indexers. Although similarity-based search is common, keyword-based search might be a better strategy depending on the application you’re building. Experimenting with different indexers can help you understand whether the retrieved context you’re typically getting is accurate or not.
- Think through your chunking strategy. Smaller chunk sizes can give you better retrieval but also give you worse LLM generations since there might be missing context. You can experiment with a various set of chunk sizes and loop over your evaluation dataset to compare performance.
- Experiment with your base prompt. Even if the retrieved context is accurate, your LLM might have problems generating accurate outputs because the original prompt you used might not be good enough. Experimenting with this can help you achieve better results with your LLM outputs.
- Finetune your embedding model. Although this approach is more technically complex and expensive, finetuning your embedding model so that it is more contextually appropriate to your use case can give you more accurate retrieved context.
Regardless of which approach you take to debug and fix hallucinations, it’s always important to continuously test your RAG system to make sure performance improves. Of course, you can use the Patronus API iteratively to confirm! 🙂
Building and testing RAG systems is simple with Patronus AI, MongoDB Atlas, and LlamaIndex 🚀
Reach out to us at email@example.com to get an API key and learn more!