Why Manually Testing LLMs is Hard
The industry needs products like Patronus AI for model testing, evaluation data sources, and security.
- Suja Chandrasekaran, former CIO/CTO of Walmart, Kimberly Clark, Timberland, and Nestle
LLMs are great, but they are unpredictable
It’s easy to see headline statistics that suggest a glimmer of “General Intelligence” and think that LLMs are all-knowing oracles. After all, GPT-4 scores in the 90th percentile for the SAT in both Math and English. That’s enough to get you into a top-tier university. However, just because an LLM demonstrates great potential does not mean that it can be used for any task.
Unlike traditional software systems, LLMs are giant probabilistic machines. They do not follow a pre-programmed set of rules, nor do they just “remember” a correct response or index an internal database (unless explicitly told to). Instead, LLMs are truly coming up with their responses “in-context”. The same reason LLMs are powerful and creative is also why they are unpredictable, and produce failures that cannot be explained.
So how can we prevent and catch these failures as they occur?
Current State: Testing By Inspection
After conversations with hundreds of engineers using LLMs, we learned that the main way most are testing LLMs today is by manual inspection - i.e. write a couple of prompts and have a human inspect the outputs. The prompts are often written on the fly, different engineers have different judgments, and people miss errors that look plausibly correct (good LLMs are always confident, even when wrong 😉).
Of course, “gut checking” model performance like this is fine for simple, low-risk scenarios like generating ideas for a birthday party or having a fun philosophical discussion about whether AI will take over humankind. In these scenarios, model failures might even be amusing.
But what about critical, high-stakes activities? As an enterprise, can you trust an LLM to…
- Answer financial questions on earnings documents, where an error in calculation can result in millions of dollars lost in transaction?
- Review a legal document, where an incorrect assessment can lead to high profile litigation years down the road?
Testing a few examples is equivalent to testing in production. This is simply not good enough for critical or complex use cases. Enterprises seeking to use LLMs need better “test coverage” and performance guarantees for their LLM-powered applications.
The Brute Force Solution
The real challenge with evaluating LLMs is the size of the problem space. Your keyboard has about 100 distinct values. With a string of just 10 characters you end up with over a quintillion possible inputs (100^10)! Checking every possible to an LLM is like brute-forcing a 128-bit password — it’s totally impossible.
Even if you manually checked 10,000 outputs, which is a huge amount of work to organize and review, it only amounts to 0.0000000000000002% of all possible inputs. This problem gets even bigger (and worse) with multimodal models, which have an even large surface to work across. It’s simply not possible to check everything.
Now, human evaluation of LLM outputs definitely has its place — for instance, big labs like OpenAI and Meta have experts red team their LLMs to identify weaknesses and blindspots. This works if you’re building a foundation model with lots of money, have access to subject matter experts, and are running a whole bank of tests. But it’s costly, time consuming, and simply not feasible for most companies.
Enter Patronus AI
What if you could leverage automated techniques to make evaluation more scalable and effective? With such techniques, you can create test cases that are really challenging and which exploit weaknesses. These tests can then be used to really push models to their limits. And, of course, you can use automated techniques to evaluate the results as well.
Right now, the process still involves a lot of careful human oversight and control to make sure that tests are valid, comprehensive, and difficult. At Patronus AI, we pioneer new ways of evaluating AI through our targeted evaluations, custom datasets, and explanations. We make the automated techniques described above incredibly easy to use. Our work is already enabling enterprises to scalably evaluate and test LLMs to get back results they can trust.