Patronus AI launches FinanceBench, the industry’s first benchmark for LLM performance on financial questions
NEW YORK, NY—THURSDAY, NOVEMBER 16—Today, Patronus AI launched “FinanceBench”, the industry’s first benchmark for testing how LLMs perform on financial questions.
Developed by AI researchers at Patronus AI and 15 financial industry domain experts, FinanceBench is a high quality, large-scale set of 10,000 question and answer pairs based on publicly available financial documents like SEC 10Ks, SEC 10Qs, SEC 8Ks, earnings reports, and earnings call transcripts. It is presented as a first line of evaluation for LLMs on financial questions, with more advanced tests to be released in the future.
Initial analysis by Patronus AI shows that state-of-the-art LLM retrieval systems fail spectacularly on a sample set of questions from FinanceBench.
- GPT-4 Turbo with a retrieval system fails 81% of the time
- Llama 2 with a retrieval system fails 81% of the time
Patronus AI also evaluated LLMs with long context windows, noting that they perform better but are less practical for use in a production setting. In particular,
- GPT-4 Turbo with long context fails 21% of the time
- Anthropic’s Claude-2 with long context fails 24% of the time
Patronus AI notes that LLM retrieval systems are commonly used by enterprises today for multiple reasons. LLMs with long context windows are not only much slower and more expensive to use, but the context windows are still not large enough to support long documents typically used by analysts.
“While LLMs show promise in analyzing mass volumes of financial data, most models out in the market need a lot of refinement and steering to work properly,” Anand Kannappan, CEO and co-founder, Patronus AI. “And based on our evaluation of GPT-4 Turbo and other models, the margin of error is just too big for financial applications.”
“Analysts are spending valuable time creating prompt test sets to evaluate LLM retrieval systems and manually inspecting outputs to identify hallucinations,” Rebecca Qian, CTO and co-founder, Patronus AI. “And there exist no benchmarks that can help identify exactly where LLMs fail in real world financial use cases. This is precisely why we developed FinanceBench.”
The new benchmark spans several LLM capabilities in finance:
- Numerical reasoning: Finance metrics requiring numerical calculations, e.g. EBITDA, PE ratio, CAGR.
- Information retrieval: Specific details extracted directly from the documents.
- Logical reasoning: Questions involving financial recommendations, which require interpretation and a degree of subjectivity.
- World knowledge: Basic accounting and finance questions that analysts are familiar with.
As a part of this release, customers can now evaluate their LLM system against FinanceBench on the Patronus AI platform. The platform can also detect hallucinations and other unexpected LLM behavior on financial questions in a scalable way. Several financial services companies are piloting Patronus AI in the coming months. For more information about FinanceBench, reach out to Patronus AI via email@example.com.
Read the full research paper: https://patronus.ai/financebench.pdf
Download the FinanceBench data sample on HuggingFace: https://huggingface.co/datasets/PatronusAI/financebench
Download the FinanceBench data sample on Github: https://github.com/patronus-ai/financebench
About Patronus AI
Patronus AI is the first automated AI evaluation and security platform for enterprise. The platform enables enterprise development teams to score LLM performance, generate adversarial test cases, benchmark LLMs, and more. Customers use Patronus AI to detect LLM mistakes at scale and deploy AI products safely and confidently. For more information, visit https://www.patronus.ai/.