RAG Evaluation Metrics: Best Practices for Evaluating RAG Systems

With the advent of LLMs, retrieval-augmented generation systems (RAG systems) have become a common architectural pattern within enterprises. RAG systems help organizations apply an LLM’s natural language understanding ability to the organization’s own data. However, while this solves the problem of embedding organizations' own data into an LLM’s decision process, the results can often vary according to use cases.

RAG evaluation metrics help measure RAG systems' effectiveness and benchmark them against target values. This article explains popular RAG evaluation metrics, how to implement them, and best practices for taking RAG systems to production.

Summary of key concepts related to RAG evaluation metrics

Concept	Description
RAG evaluation	RAG evaluation involves measuring the effectiveness of two key components: context retrieval and the generated response.
RAG evaluation metrics	Five key metrics are used to evaluate RAG performance: context relevance, context sufficiency, answer relevance, answer correctness and answer hallucination.
Context retriever evaluation	Context relevance and context sufficiency are used to evaluate context retrievers. Context relevance measures the extent to which the fetched context is relevant to the user query. Context sufficiency evaluates whether the fetched context contains enough information to answer the user query correctly.
Generator (e.g. LLM) evaluation	Answer relevance, answer correctness, and answer hallucination are key metrics used to evaluate the generator's performance.
Popular RAG evaluation frameworks	Open-source frameworks like RAGAS, Deep Eval, and Truelens and cloud-based managed services like Patronus AI provide commonly used RAG evaluation frameworks.
Best practices for RAG evaluation	These are some of the relevant key practices: Establish a gold standard early in the lifecycle. Choose accuracy metrics based on relevant use cases. Establish security metrics. Set up automated testing patterns. Continuously evolve gold references. Establish thresholds for detecting drift.

Understanding RAG evaluation

Retrieval augmented generation is an architectural pattern that enables internal organizational data to be fed into LLMs. It helps LLMs prioritize the organization's own data and use it as context rather than relying on its own trained data. It separates an LLM's natural language understanding ability from trained knowledge and responds to user queries using fact snippets fed as the context. The RAG pattern works well for unstructured data sources like long-form text.

The RAG pattern consists of three main components.

A vector database that holds organizations' factual knowledge. The vector database consists of knowledge assets converted into embeddings using a text-to-embedding model. The vector database represents the retriever part of the RAG pattern.
A set of prompts that combines relevant facts fetched from a vector database with the instructions for the LLM.
The LLM that receives the prompt along with the facts to respond appropriately.

The basic RAG flow starts by converting user queries into an embedding using a text-to-embedding model. The generated embedding is then searched against the vector database to fetch documents that contain text with similar meaning. This step is called context retrieval. The fetched documents are then bundled with the user query and the application-specific prompt. This bundle is sent to the LLM API to fetch the results for the user query. This second stage is called response generation.

The following figure explains a RAG workflow and typical reasons why a RAG pattern responds incorrectly. The context retrieval and generator errors will be explained in more detail in the RAG evaluation metrics section.

RAG Evaluation Metrics: Best Practices for Evaluating RAG Systems — Example of the RAG workflow with areas for evaluation and optimization

Challenges in evaluating RAG Systems

Any machine learning model is evaluated by comparing its responses with a gold standard dataset. This comparison is challenging in the case of LLMs for several reasons.

First, since LLMs are great at natural language understanding and generation, they can frame the response in several ways without altering the meaning. It is not feasible to have all possible combinations of correct responses in a gold dataset.

Second, LLMs often exhibit inconsistencies. The same input and context combination can result in several output variations when executed repeatedly.

Methodologies for evaluating LLMs

Traditionally, language models were evaluated based on statistical metrics that compared the expected and actual outputs based on the extent of the match between them. Statistical metrics do not consider the meaning of the outputs and use partial word matches to arrive at a value. Hence, they do not work well for LLM-generated text.

Some of the commonly used statistical metrics are described below.

BLEU And ROGUE

Bilingual Evaluation Understudy (BLEU) measures the precision of n-grams in output compared to test data. N-grams here refer to substrings with n number of words. BLEU also includes a brevity penalty term that penalizes output that is shorter than the gold response. Recall Oriented Understudy For Gisting (ROGUE) measures the recall of n-grams compared to the gold response.

Perplexity

Perplexity considers the probability of a word being the next word output by the model, given a sequence of words. It is computed against the words in the gold standard response.

Embeddings-based comparison and LLM as a judge

As evident from the definitions above, these metrics search for exact word matches between the gold standard and actual output. This is not a good strategy when evaluating LLMs with very superior vocabulary, so such methods are not recommended anymore. This is where embedding-based similarity and LLM-as-a-judge approaches come in.

Embeddings are vector numeric representations of words and sentences generated by embedding models. Such models are trained on millions of sentences to understand the nuances of the meanings of words. The idea is that the distance between numerical representations for similar words is lower than for completely different words. Measures like cosine similarity are used to measure the distance between embeddings.

LLM-as-a-judge approaches use an LLM itself to compare the actual response and the gold response, which allows for more granular comparisons based on predefined metrics that prioritize the use of case-specific nuances. The well-known G Eval paper presents a framework for evaluating RAG systems using an LLM and chain of thought prompts.

The LLM-as-a-judge approach is not limited to using generic LLMs like Open AI or LLAMA3; there are LLMs explicitly trained to act as judges. These LLMs do a better job of detecting hallucinations and context relevance. Lynx and Glider are examples of LLMs trained for evaluation.

RAG evaluation metrics

Now that we have looked at what can go wrong with RAG output let’s examine the metrics that represent the effectiveness of RAG evaluation. Note that this article focuses on the effectiveness of the retriever and generator rather than the evaluation of text embedding quality.

Context effectiveness

RAG architecture retrieves context or facts by converting the user query into an embedding and fetching documents that are closest to it. At times, the fetched context may not be relevant or sufficient to answer the actual context for reasons like low-quality embeddings, improper chunking strategies, or insufficient data. Context effectiveness metrics measure the quality of the retrieved context.

RAG implementations usually limit the number of context entries fetched from the vector databases based on a constant number. This is done to minimize the amount of data that is bundled along with the prompt for the LLM to process. Context effectiveness metrics are computed specific to this limiting parameter. For example, if the implementation limits the number of context entries from the vector database to a number K, the metrics are defined in terms of top K results.

Context relevance: This metric represents the extent to which correct context information is present in the overall context fetched from the vector database. In academic terms, this is also known as context precision. This is calculated through the below formula.

The evaluation frameworks usually extract all the statements in retrieved context using an LLM and then classify each of them as relevant or irrelevant using another LLM call to calculate this metric.

Context sufficiency: This metric represents the extent to which the fetched context includes information to respond to user queries correctly. In academic terms, this is similar to context recall and requires comparison with a gold answer. Mathematically, this is computed using the below formula.

Here, the number of attributable statements is calculated by counting the number of statements in gold reference answers that can be directly matched with a statement in the retrieved context.

‍

Optimizing chunking strategies and playing around with chunk overlaps can help improve these metrics. For example, one may be using a fixed chunking strategy of 500 tokens per chunk. For a large document, this would mean individual chunks have very little information, and fetched chunks simply may not have enough information to answer the user query.

To optimize the retrieval process, one can use more involved chunking strategies like recursive chunking or semantic chunking. Recursive chunking recursively scans words until the specified chunk size or a token separator—like a full stop or new line—is found. Semantic chunking considers the meaning of the chunks to group adjacent sentences with related meanings together. Another approach is to introduce a small overlap between the chunks to ensure that chunks also have some information from neighboring chunks.

Using a hybrid querying method that combines any available metadata information while querying the vector database is also an alternative to improve context retrieval.

Generator effectiveness

LLMs often exhibit poor consistency and faithfulness even when the prompts contain the correct context. For example, consider the following table showing a user query, retrieved context, and final response.

User query	Retrieved Context	Final Response
Who is David Beckham?	David Beckham is a former English soccer player who played for Manchester United and Real Madrid.	David Beckham is a soccer player who played for Real Madrid.

Here, the LLM left out key information about the current status of the player and the clubs he played for, even though the context included information to answer the question completely.

The effectiveness of the LLM generator can be measured in terms of these metrics:

Answer relevance: This metric represents whether the model answer is relevant to the user input. Answer relevance is measured using the below formula.

Answer correctness: This represents whether the answer is factually correct and includes all the information compared to the gold standard response. In other words, it measures whether the model output aligns with the gold answer.
Answer hallucination: This measures whether the answer is faithful to the retrieved context. In other words, it measures if the answer contains information not present in the fetched context or is misrepresented compared to the fetched context. This is also known as faithfulness. This metric is calculated using the below formula.

Evaluation frameworks use LLMs to identify the individual claims made in the output and assess if the claims are supported by the fetched context.

Security considerations

RAG system evaluation is not limited to context and generator effectiveness. Enterprises need to evaluate RAG systems for any possible compliance or security violations. Here are some of the critical security checks for an LLM:

Prompt injection: Prompt injection refers to a scenario where user input contains a prompt, and the LLM erroneously executes it, bypassing the system prompts set up by the application. This can lead to a user taking control of the application and making the LLM provide harmful responses or perform harmful actions. Testing this requires one to query the LLM using inputs that may trigger prompt injection and validate if an adverse LLM behavior is triggered. Patronus AI’s prompt injection dataset is a good example of such a dataset.
Sensitive data leakage: Organizations fine-tune LLMs to customize them for their requirements using internal data. This data may include confidential information, and the LLM may include it in a response. An improper context retrieval implementation may also lead to the LLM responding with confidential information. Testing this requires one to query the LLM using inputs that may trigger leaking sensitive data and validate the responses for sensitive data. Patronus AI’s data leakage prompt dataset is a good starting point.

OWASP provides a comprehensive list of LLM vulnerabilities that one must test before deploying an RAG in production.

Implementing RAG evaluation

Several open-source and cloud-based managed services exist in the RAG evaluation space. These frameworks have built-in functions and models for evaluating LLMs based on the abovementioned metrics.

RAGAS

RAGAS is an open-source library that provides tools to evaluate large language model-based applications. RAGAS supports all the popular evaluation metrics and can integrate with several LLM frameworks like Langchain, LLamaIndex, Haystack, etc. It also supports generating synthetic data for evaluation. RAGAS can also help with evaluating mult-turn conversations that result in a binary outcome.

RAGAS is a programmatic framework that developers can add to their application code and use the built-in functions to configure evaluations. Under the hood, it uses recursive LLM calls to compare the responses. At times, RAGAS fails to extract statements from the RAG responses resulting in wrong computations.

Lynx is an open-source hallucination detection model that outperforms RAGAS on hallucination, especially in long context cases. See the results of a HaluBench hallucination evaluation benchmark here that includes RAGAS Faithfulness.

DeepEval

DeepEval is an open-source LLM evaluation framework that can help with all activities related to RAG evaluation. It can help synthesize a golden dataset, use custom-rule-based evaluation criteria, and support well-known metrics like context precision, context sufficiency, faithfulness, etc. DeepEval also allows the user to choose an LLM to execute its evaluation.

At its core, DeepEval is a programmatic framework that one can use to run the evaluations. While the framework contains many built-in functions, it is one with a steep learning curve and requires quite a bit of engineering skillset to execute properly.

DeepEval internally calculates the metrics using recursive calls to LLMs. Since the LLM calls are not greatly optimized, it is common to hit throttling limits imposed by LLM providers like Open AI while using DeepEval. This also leads to cost spikes if one is using cloud-based LLMs.

TruLens

TruLens is an open-source framework for systematically evaluating and tracking LLM experiments. It is tightly integrated with open-source RAG frameworks like Langchain and Llamaindex. TruLens also helps create guardrails and context filters based on the valuation of RAG pipelines, which improves evaluation metrics.

A TruLens code snippet to evaluate an RAG output using an answer relevance metric is given below.

from trulens.apps.virtual import VirtualRecord
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.virtual import TruVirtual
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

#Setting up data 
import pandas as pd

data = {
    "query": ["Where is Hungary?"],
    "response": ["Hungary is in America"],
    "contexts": [
        ["Hungary is a country located in Europe."]
    ],
}
df = pd.DataFrame(data)
df.head()

from trulens.apps.virtual import VirtualApp

virtual_app = VirtualApp()

from trulens.core import Feedback
from trulens.providers.litellm import LiteLLM

# Initialize provider class with Ollama
provider = LiteLLM(
    model_engine="ollama/llama3.2:1b", api_base="http://localhost:11434"
)

# Select context to be used in feedback.
context = VirtualApp.select_context()

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(context)
)

# Define a groundedness feedback function
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(context.collect())
    .on_output()
)

# Creating session and resetting current databases. 
from trulens.core import TruSession
from trulens.dashboard import run_dashboard
from trulens.dashboard import stop_dashboard

session = TruSession()
session.reset_database()
stop_dashboard(session)
run_dashboard(session)

from trulens.apps.virtual import TruVirtual

virtual_recorder = TruVirtual(
    app_name="MyRAG",
    app_version="v0.1",
    app=virtual_app,
    feedbacks=[f_context_relevance, f_groundedness],
)
virtual_records = virtual_recorder.add_dataframe(df)

# View results. 
session.get_leaderboard()

‍

The snippet initializes the Trulens framework and defines a data frame with input, output, and context data. It computes context relevance and groundedness metrics. The result will look like the output below.

‍

Since the LLM failed to provide a correct answer according to our data set, it assigned a score of 0.0 for groundedness or faithfulness.

Patronus AI

Patronus AI is an automated testing and evaluation platform for generative AI applications. It provides an experimentation framework, a real-time monitoring facility, and visualization dashboards. Patronus AI also comes with built-in models fine-tuned for LLM evaluation and can help generate data sets for LLM evaluation.

Going beyond the common RAG evaluation metrics, Patronus AI comes with several off-the-shelf evaluators for detecting features like conciseness, politeness, age bias, gender bias, racial bias, and more.

Patronus AI comes with several built-in functions for evaluating RAG applications. It can help measure answer relevance, context relevance, context sufficiency and hallucination. For example, to measure answer relevance through UI, one can just select the ‘Evaluators’ tab from the dashboard and add the model input and output details.

‍

You can find many other built functions relevant to RAG evaluation in Patronus AI.

Patronus AI also comes with a powerful built-in model: Lynx 2.0, for detecting hallucinations. It also comes with support for detecting security vulnerabilities like prompt injection, data leakage, insecure outputs, and insecure agency. Patronus AI is not limited to UI based evaluations. It can also be used programmatically.

Installing Patronus AI is as simple as typing the command below into your console.

pip install patronus

‍

Patronus supports detecting eight kinds of hallucinations.

Hallucination detection can be implemented using Patronus AI using the code snippet below.

import patronus
from patronus.evals import RemoteEvaluator

patronus.init(api_key=<api_key>)

check_hallucination = RemoteEvaluator("lynx", "patronus:hallucination")

resp = check_hallucination.evaluate(
task_input="What is the car insurance policy qualification criteria?",
task_context="To qualify for our car insurance policy, you need a way to show competence in driving which can be accomplished through a valid driver's license. You must have multiple years of experience and cannot be graduating from driving school before or on 2028.",
task_output="To even qualify for our car insurance policy, you need to have a valid driver's license that expires later than 2028."
)

‍

This code snippet initializes a dataset and defines the lynx-small model as the judge of hallucination. The term task_input is the user query, and task_output is the actual response from the model. The term task_context represents the context retrieved.

‍

The output from code execution can be visualized in the Patronus AI dashboard.

‍

The record shows the pass status as False and provides the reasoning for the status. In this case, the reason is that the answer contained information that is not present in the context, which means the LLM was hallucinating.

Best practices for evaluating RAG Systems

Frameworks like DeepEval, TruLens, Patronus AI, etc, help streamline the process of RAG evaluation, but they are not a replacement for systematic processes to evaluate RAG Systems. Here are some of the best practices that can help you get the most out of your RAG evaluation endeavor.

Establish a gold standard early in the lifecycle

Gold datasets are critical in evaluating RAG Systems, and it is important to create them specifically for your use case early in the development cycle. You can use LLMs themselves to create the gold reference data sets. Evaluation frameworks like DeepEval can also generate gold-standard datasets.

Another option is to rely on open-source data sets. For example, Patronus AI provides a curated dataset for evaluating the performance of LLMs in question-answering use cases related to finance. While many frameworks help synthesize gold data, you should always manually verify it to ensure its integrity. This is where the organization’s domain experts must be closely involved while evaluating RAG Systems.

Choose accuracy metrics based on relevant use cases

Typical metrics like context relevancy, context sufficiency, answer relevance, hallucination, etc., provide valuable information while evaluating RAG Systems. Consider augmenting these metrics with the use of case-specific metrics to get the best out of the RAG evaluation. For example, politeness or apologetic behavior may be a preferred trait for a customer service chatbot. Similarly, a RAG operating in the finance or pharma sector may have specific requirements. LLM-as-a-judge frameworks like Patronus AI can be used to create such metrics.

Establish security metrics

Because of their universal accessibility, chatbots are vulnerable to several security attacks once deployed. RAG evaluation must include tests to detect prompt injection vulnerability, sensitive data leakage, and insecure outputs.

Set up automated testing pipelines

RAG implementations go through continuous changes even after they are deployed. This can be because of the dynamic nature of LLMs, the addition of further knowledge assets, or even the availability of improved LLMs. Playing catchup with such frequent changes will only be possible if there is a continuous integration set up with automated testing pipelines.

Continuously evolve gold references

Gold sets cannot be left alone after creation during the start of the development cycle. They need to be continuously improved while new features are added. Gold references must be properly versioned, and accuracy results must always be tied to a specific version of the gold standard dataset. Frameworks like Patronus AI help streamline gold reference creation and maintenance.

Establish thresholds for detecting drift

RAG implementations undergo continuous change because of the frequently changing requirements and evolution of LLMs. To safeguard against a gradual decline in output, it is important to set thresholds for detecting drift. These thresholds must be part of the automated evaluation and need to trigger alerts if breached.

{{banner-dark-small-1="/banners"}}‍

Last thoughts

RAG evaluation involves measuring the effectiveness of context retrieval and response generation. Evaluating RAG Systems is particularly difficult because of the LLM's ability to craft different responses with the same meaning and because of overall inconsistencies.

Measuring against a gold standard with an LLM-as-a-judge approach is the best method for evaluating LLMs. Frameworks like DeepEval, TruLens, and Patronus AI provide functions to streamline this process. Beyond the typical metrics related to context, generator, and hallucinations, one must also consider security tests to check against prompt injection, insecure outputs, and sensitive data leakage.

‍