Powerful AI Evaluation and Optimization

The best way to ship top-tier AI products. Based on industry-leading AI research and tools.

Book a call Start for free

Trusted by leading companies around the world

Score and Optimize your LLM System in Seconds

Use the Patronus API in any stack

File name.js

1curl -X POST "https://api.patronus.ai/v1/evaluate" \
2  -H "X-API-KEY: " \
3  -H "Content-Type: application/json" \
4  -d '{
5    "evaluators": [
6      {
7        "evaluator": "lynx",
8        "criteria": "patronus:hallucination",
9        "explain_strategy": "always"
10      }
11    ],
12    "evaluated_model_retrieved_context": ["The capital of France is Paris"],
13    "evaluated_model_input": "What is the capital of France?",
14    "evaluated_model_output": "The capital of France is Dublin, which is located on the Seine River.",
15  }'

Language

File name.js

1from patronus import init
2from patronus.evals import RemoteEvaluator
3
4init(api_key="<token>")
5
6hallucination = RemoteEvaluator("lynx", "patronus:hallucination")
7
8result = hallucination.evaluate(
9    task_input="What is the capital of France?",
10    task_output="The capital of France is Paris, which is located on the Seine River.",
11    task_context="The capital of France is Paris."
12)

Language

File name.js

1fetch("https://api.patronus.ai/v1/evaluate", {
2  headers: {
3    "X-API-KEY": "<token>"
4  },
5  body: JSON.stringify({
6    "evaluators": [
7      {
8        "evaluator": "lynx",
9        "criteria": "patronus:hallucination",
10        "explain_strategy": "always"
11      }
12    ],
13    "evaluated_model_system_prompt": "",
14    "evaluated_model_retrieved_context": ["The capital of France is Paris."],
15    "evaluated_model_input": "What is the capital of France?",
16    "evaluated_model_output": "The capital of France is Dublin, which is located on the Seine River.",
17    "evaluated_model_gold_answer": "",
18  })
19})
20    .then(response => response.json())
21    .then(data => console.log(data))
22    .catch(error => console.error(error));

Language

Product Capabilities

Start with Patronus on Day 0 and never look back.

Patronus Evaluators

Access industry-leading evaluation models designed to score RAG hallucinations, image relevance, context quality, and more across a variety of use cases

Patronus Experiments

Measure and automatically optimize AI product performance against evaluation datasets

Patronus Logs

Continuously capture evals, auto-generated natural language explanations, and failures proactively highlighted in production

Patronus Comparisons

Compare, visualize, and benchmark LLMs, RAG systems, and agents side by side across experiments

Patronus Datasets

Leverage industry-standard datasets and benchmarks like FinanceBench, EnterprisePII, SimpleSafetyTests, all designed for specific domains

Patronus Traces

Automatically detect agent failures across 15 error modes, chat with your traces, and autogenerate trace summaries

Why Us

We take a research-first approach

The team at Patronus has been testing LLMs since before the GenAI boom

Our approach is state-of-the-art → +18% better at detecting hallucinations than other OpenAI LLM-based evaluators*

*benchmarks available upon request

We offer production-ready LLM evaluators for general, custom, and RAG-enabled use cases

Our off-the-shelf evaluators cover your bases (e.g. toxicity, PII leakage) while our custom evaluators cover the rest (e.g., brand alignment) 
We support real-time evaluation with fast API response times (as low as 100ms) 
You can start using the Patronus API with a single line of code

We offer flexible hosting options with enterprise-grade security

No need to worry about managing servers with our Cloud Hosted solution 
Our On-Premise offering is also available for customers with the strictest data privacy needs 
You can rest assured that your proprietary data will never be shared outside our organization 
We get vetted by third-party security companies yearly

We are trusted by a strong array of customers and partners

Patronus is the only company to provide an SLA guarantee of 90% alignment between our evaluators and human evaluators

Our customers include OpenAI, HP, and Pearson

Our partners include AWS, Databricks, and MongoDB

Industry Leading

AI Research

Our AI research team is behind cutting-edge AI evaluation agents, evaluation models, and evaluation benchmarks, which are now used by hundreds of thousands of organizations and developers around the world.

Lynx

An Open Source Hallucination Evaluation Model

View Paper Read Post

FinanceBench

A New Benchmark for Financial Question Answering

View Paper Read Post

SimpleSafetyTests

A Test Suite for Identifying Critical Safety Risks in Large Language Models

View Paper View More

GLIDER

Grading LLM Interactions and Decisions using Explainable Ranking

View Paper View More

What they say about us

As scientists and AI researchers, we spend significant time on model evaluation. The Patronus team is full of experts in this space, and brings a novel research-first approach to the problem. We're thrilled to see the increased investment in this area.

Jonathan Frankle

Chief AI Scientist at Databricks

"Evaluating LLMs is multifaceted and complex. LLM developers and users alike will benefit from the unbiased, independent perspective Patronus provides."

Max Bartolo

Command Modeling Lead at Cohere

"Testing LLMs is in its infancy. The best methods today rely on outdated academic benchmarks and noisy human evaluations -- equivalent to sticking your finger in water to get its temperature. Patronus is leading with an innovating approach."

Andriy Mulyar

Co-founder and CTO of Nomic AI

"Engineers spend a ton of time manually creating tests and grading outputs. Patronus assists with all of this and identifies exactly where LLMs break in real world scenarios."

Linus Lee

AI Whisperer

Patronus AI doesn’t just help you build trust in your generative AI products, they make sure your own users trust your products too. They always go one step further to make sure you succeed with your AI use case in production.

Azadeh Moghtaderi

Vice President of Data

The Patronus team is taking a holistic and most innovative approach to finding vulnerabilities in LLM systems. Every company that wants to build LLM-based products will need to solve for it and the Patronus team is the most thoughtful group tackling this problem.

Barkha Saxena

CDO at Chime

One of the standout features of Patronus is its customizability. I can bring my own evaluations or set up my own Custom Evaluator in 30 seconds, and then do everything else from there within the platform.

Chen Peng

VP, Head of Data & ML of Faire

Patronus AI is at the forefront of multilingual AI evaluation. DefineX is excited to be using Patronus’ proprietary technology to safeguard generative AI risks in the Turkey & Middle East region and beyond.

Emre Hayretci

Co-founder and Managing Director at DefineX

Patronus and their straightforward API makes it really easy to reliably evaluate issues with LLMs and mitigate problems like content toxicity, PII leakage, and more. We're excited to partner with Patronus to combine their evaluation capabilities with Radiant's production reliability platform to help customers build great GenAI products.

Nitish Kulkarni

Co-founder and CEO of Radiant AI

I love that Patronus supports both offline and online workflows. It’s a game changer when an engineering team has to do no extra work in making their offline evaluation setup work in real-time settings. This is because their API is really easy to use, and is framework-agnostic and platform-agnostic.

Lior Solomon

VP of Data at Drata

In our mission to bring the AI stack close to enterprise data and offering best in class tools to train and deploy AI solutions, we are thrilled to partner with Patronus AI. Our combined platform will help in training, finetuning, rigorously testing, and monitoring LLM systems in a scalable way.

Mouli Narayanan

Founder and CEO of Zeblok

AI won’t take your job but it will change your job description. Safety in the workplace and security in the workspace is the only way to be AI-ready. That’s only possible with Patronus.

Gabriel Paunescu

Co-founder and CEO of Naologic

One of the neat things about the Patronus experience is the part that comes after catching LLM mistakes - insights with natural language explanations, failure mode identification, and semantic clustering.

Dave Burgess

VP of Data

"Patronus helped our AI team find signals and patterns of error in our datasets. Their LLM Judges enabled us to triage errors and optimize our AI outputs in production settings. Patronus up-leveled our evaluation process and was an invaluable part of our workflow."

Jon Noronha

Co-Founder of Gamma

The Most Powerful

AI Evaluation & Optimization
Platform.

Built on

Leading AI Research.

View Our Partners

Ready to level up your AI evaluation approach?

Book a call

Our latest update

Introducing the Patronus API

The most reliable way to score your LLM system in development and production.

Meet the Patronus Evaluators.

State-of-the-art evaluation models at your fingertips. Designed to help AI engineers scalably iterate AI-native workflows like RAG systems and agents.

Patronus Evaluation Capabilities

System Performance

Hallucinations

Context relevance

Answer relevance

Context Sufficiency

Answer Correctness

Security

Prompt injections

Sensitive data leakage

Bias

Toxicity

OWASP risks

Alignment

Off topic

Conciseness

Brand alignment

Tone of voice

Style

Bring Your Own Evaluator

Use the SDK to configure custom evaluators for function calling, tool use, and more

File name.js

1import patronus
2from patronus.evals import evaluator, EvaluationResult
3from patronus.experiments import run_experiment
4from openinference.instrumentation.openai import OpenAIInstrumentor
5
6patronus.init(api_key="")
7
8@evaluator()
9def exact_match(row):
10    gold_answer = row.evaluated_model_gold_answer.lower()
11    model_output = row.evaluated_model_output.lower()
12
13    # Create a detailed explanation
14    explanation = f"Exact match detected"
15    explanation_false = f"No exact match detected"
16
17    result = gold_answer == model_output
18    return EvaluationResult(
19        pass_= result,
20        score= 1 if result else 0,
21        explanation= explanation if result else explanation_false,
22    )
23
24dataset = [
25    {
26        "evaluated_model_input": "What was the company's revenue in Q4 2023?",
27        "evaluated_model_output": "According to the financial report, the company's revenue in Q4 2023 was $5.2 billion, showing a 15% increase from the previous quarter.",
28        "evaluated_model_gold_answer": "$5.2 billion"
29    },
30    {
31        "evaluated_model_input": "How many employees were hired in 2023?",
32        "evaluated_model_output": "Based on the annual report, the company hired 1,200 new employees globally in 2023, with 60% of hires in engineering roles.",
33        "evaluated_model_gold_answer": "1,200 new employees"
34    }
35]
36
37
38experiment = run_experiment(
39    dataset= dataset,
40    evaluators=[exact_match],
41    tags={"dataset": "finance_dataset", "model": "gpt-4o-mini"},
42    integrations=[OpenAIInstrumentor()],
43    project_name="Finance Dataset",
44    experiment_name="GPT 4o mini"
45)

Language

Get in touch!

Thank you! Your submission has been received, we'll be in touch soon!

Oops! Something went wrong while submitting the form. Please try again.