Nova AI is building one of the most technically challenging and ambitious agent applications we have seen. Nova AI is revolutionizing enterprise transformation through its AI-powered platform for end-to-end SAP custom code modernization, enabling businesses to transition their SAP technical stack at half the cost. Read more to see how Nova uses Patronus to automatically catch agent errors and fix prompts, reducing manual debugging time and increasing agent accuracy.
Hear it from Nova’s AI team:
“FIX MY PROMPTS! I’m tired of being a vague and unspecific prompt engineer” – Paul Modderman, founding engineer
“Automated prompt fixes are awesome—it plugs straight into our revision cycle. Eventually, I’ll probably just feed the suggested prompt edits into a ‘revise my prompt’ prompt. It’s like infinite prompt recursion, and I kind of love it.” – Paul Modderman, founding engineer
“The agent analysis summaries are critical to my development workflow. When I review the prompt output before it enters the agent cycle, I usually catch some issues—but the summaries consistently surface things I overlook. They’ve become an essential second layer of quality control.” – Mark Klein, lead AI engineer
Key Results
Nova’s mission is ambitious: to build the world’s first SAP migration agent for the world’s largest enterprises. In order to achieve this, Nova’s AI team has built a fleet of AI agents that learn to navigate the complex landscape of SAP APIs, which involve hundreds of custom API endpoints and versions.
Nova’s AI team approached Patronus in search of an automated solution for the evaluation of distributed, multi-agent workflows. Each Nova agent run is long and complex—taking 20–30 minutes and involving hundreds of LLM calls across multiple sub-agents and tools. To support production workloads at scale, Nova AI needs to systematically improve accuracy. But today, debugging and evaluation are bottlenecks for AI teams:
The lack of agent trace analysis tooling means Nova can't efficiently run experiments, measure improvements, or guide the agent to do better.
Deep Dive: SAP RAP Agent Evaluation Workflow
The ABAP RESTful Application Programming Model (RAP) is SAP’s framework for building cloud-ready, RESTful applications on the SAP S/4HANA platform using ABAP (Advanced Business Application Programming). Nova’s RAP agent builds on this interface to create enterprise-grade applications.
The Nova AI team used Patronus to set up an end-to-end RAP agent evaluation workflow that achieves the following:
With Patronus, Nova AI turned evaluation from a bottleneck into a strategic advantage—enabling faster iterations, clearer insights, and higher-quality SAP agents. Here’s how they achieved it 👇
For Nova AI, Percival automatically caught and fixed multiple domain specific issues that are difficult for a human to identify, due to long contexts and domain knowledge.
Percival suggested a revised prompt that instructs the agent to explicitly validate activation status before proceeding to create the next object.
For each object you create, validate that its activation status is successful after it is created. Log the status and any errors, and only proceed to the next object if activation and checks are successful. After activating all objects, you MUST execute API tests using the rap_api_testing_tool. Report the results, and if tests fail, address the issues, reactivate, and retest UNTIL all tests pass successfully. DO NOT SKIP THIS STEP.
This updated prompt reduced the error rate in object creation and improved the reliability of the SAP RAP agent.
Percival automatically detected the error as well as the location (span) where the exception occurred. This is an example of an output generation error that impacts the agent’s reliability and instruction following scores.
In response, Percival suggested the following prompt fix:
Ensure that any annotation value that is created has a length that does not exceed the limit of the data type it has been assigned to, for example, ensure any String(40) is less than equal to 40, and that if it does truncate it and output a warning message to the user.
This prompt fix ensured that future annotations generated by the RAP agent were within the acceptable length.
While debugging individual traces is critical to prompt engineering workflows, AI engineering teams want to understand different failure modes across their application. Hearing this feedback, we built an agent error leaderboard that categorizes different kinds of agent failures.
This allows engineers like Mark and Paul to group the agent errors they encounter and see agent performance in aggregate. For example, they can identify when a specific PR introduced a regression and caused new error types, or when a prompt fix eliminated an error category.
Through Nova AI’s repeated experimentation, they have increased the accuracy of the RAP agent on an internal SAP tool calling dataset by 60%. Nova AI demonstrates how an elite AI engineering team operates at rapid pace to solve open technical challenges. We are excited about the future of AI engineering – automated error identification, analysis and prompt optimization to supercharge AI teams.