Nova AI: Using Patronus AI's Percival to Auto-Optimize AI Agents for Code Generation

Introduction

Nova AI is building one of the most technically challenging and ambitious agent applications we have seen. Nova AI is revolutionizing enterprise transformation through its AI-powered platform for end-to-end SAP custom code modernization, enabling businesses to transition their SAP technical stack at half the cost. Read more to see how Nova uses Patronus to automatically catch agent errors and fix prompts, reducing manual debugging time and increasing agent accuracy.

Hear it from Nova’s AI team:

“FIX MY PROMPTS! I’m tired of being a vague and unspecific prompt engineer” – Paul Modderman, founding engineer

“Automated prompt fixes are awesome—it plugs straight into our revision cycle. Eventually, I’ll probably just feed the suggested prompt edits into a ‘revise my prompt’ prompt. It’s like infinite prompt recursion, and I kind of love it.” – Paul Modderman, founding engineer

“The agent analysis summaries are critical to my development workflow. When I review the prompt output before it enters the agent cycle, I usually catch some issues—but the summaries consistently surface things I overlook. They’ve become an essential second layer of quality control.” – Mark Klein, lead AI engineer

‍

Key Results

60x productivity boost by reducing agent debugging time from 1 hour to 1 minute
Automated prompt suggestions fixed 3 agent failures in 1 week
Increased agent accuracy by 60% on SAP tool dataset through experimentation

‍

Scaling Multi-Agent Evaluation Challenges

‍

Nova’s mission is ambitious: to build the world’s first SAP migration agent for the world’s largest enterprises. In order to achieve this, Nova’s AI team has built a fleet of AI agents that learn to navigate the complex landscape of SAP APIs, which involve hundreds of custom API endpoints and versions.

‍

Nova’s AI team approached Patronus in search of an automated solution for the evaluation of distributed, multi-agent workflows. Each Nova agent run is long and complex—taking 20–30 minutes and involving hundreds of LLM calls across multiple sub-agents and tools. To support production workloads at scale, Nova AI needs to systematically improve accuracy. But today, debugging and evaluation are bottlenecks for AI teams:

It takes ~1 hour to label a single agent trace, due to the need to manually inspect long logs and cross-reference SAP-specific documentation.
Agent behavior is hard to analyze, often requiring deep domain knowledge about SAP tools, documentation and codebase familiarity to assess why a generated specification is wrong.
There’s no scalable evaluation framework. Prompt changes are tested informally (“vibe checks”) without meaningful benchmarking or iteration velocity.

The lack of agent trace analysis tooling means Nova can't efficiently run experiments, measure improvements, or guide the agent to do better.

Deep Dive: SAP RAP Agent Evaluation Workflow

The ABAP RESTful Application Programming Model (RAP) is SAP’s framework for building cloud-ready, RESTful applications on the SAP S/4HANA platform using ABAP (Advanced Business Application Programming). Nova’s RAP agent builds on this interface to create enterprise-grade applications.

The Nova AI team used Patronus to set up an end-to-end RAP agent evaluation workflow that achieves the following:

Trace and observe all agent reasoning and tool usage steps, from start to finish.
Automatically score and analyze Nova agent outputs across 20+ categories of errors. This includes reasoning errors, system execution, planning and coordination, and domain specific failures.
Categorize failures by error category in a leaderboard. To quote a Nova developer, “It’s valuable to be able to categorize tool use responses into success vs. error and then group the errors by type, so we can see the most frequent ones and fix those first.”
Provide actionable prompt fixes and recommendations for building more performant AI workflows. Percival’s prompt suggestions can be copied and inserted into the agent’s prompts.
All traces are structured and searchable, making it easy to slice performance by prompt version, task type, or sub-agent tags.
Run repeated experiments for prompts and tool fixes. Nova AI’s team created a dataset that checked the accuracy of an internal tool use. Over the course of a week, the team increased the accuracy of their tool by 60% through rapid experimentation.

With Patronus, Nova AI turned evaluation from a bottleneck into a strategic advantage—enabling faster iterations, clearer insights, and higher-quality SAP agents. Here’s how they achieved it 👇

Automated Prompt Fixes with Percival

For Nova AI, Percival automatically caught and fixed multiple domain specific issues that are difficult for a human to identify, due to long contexts and domain knowledge.

Activation Status Validation: Percival correctly identified that there was no validation of successful object creation for the BDEF/BDOs components and classes, which are used to build transactional applications in SAP S/4HANA. In this case, the agent was originally instructed to run activation and checks for each object before proceeding to the next object, but did not execute these instructions. This is a type of domain-specific error that impacts the instruction adherence scores.

Percival suggested a revised prompt that instructs the agent to explicitly validate activation status before proceeding to create the next object.

‍

For each object you create, validate that its activation status is successful after it is created. Log the status and any errors, and only proceed to the next object if activation and checks are successful. After activating all objects, you MUST execute API tests using the rap_api_testing_tool. Report the results, and if tests fail, address the issues, reactivate, and retest UNTIL all tests pass successfully. DO NOT SKIP THIS STEP.

‍

This updated prompt reduced the error rate in object creation and improved the reliability of the SAP RAP agent.

‍

Annotation value length: Nova’s sub-agents sometimes generated objects with warnings about annotation value length, which violated one of the SAP API specifications. Catching this error manually would’ve required an engineer to inspect tool and LLM output spans for error messages, and search for the relevant SAP policy in the documentation.

‍

Percival automatically detected the error as well as the location (span) where the exception occurred. This is an example of an output generation error that impacts the agent’s reliability and instruction following scores.

In response, Percival suggested the following prompt fix:

‍

Ensure that any annotation value that is created has a length that does not exceed the limit of the data type it has been assigned to, for example, ensure any String(40) is less than equal to 40, and that if it does truncate it and output a warning message to the user.

‍

This prompt fix ensured that future annotations generated by the RAP agent were within the acceptable length.

‍

Agent Errors Leaderboard

While debugging individual traces is critical to prompt engineering workflows, AI engineering teams want to understand different failure modes across their application. Hearing this feedback, we built an agent error leaderboard that categorizes different kinds of agent failures.

‍

‍

This allows engineers like Mark and Paul to group the agent errors they encounter and see agent performance in aggregate. For example, they can identify when a specific PR introduced a regression and caused new error types, or when a prompt fix eliminated an error category.

Conclusion

Through Nova AI’s repeated experimentation, they have increased the accuracy of the RAP agent on an internal SAP tool calling dataset by 60%. Nova AI demonstrates how an elite AI engineering team operates at rapid pace to solve open technical challenges. We are excited about the future of AI engineering – automated error identification, analysis and prompt optimization to supercharge AI teams.

‍