6 min read
Between Gut Feeling and Evaluation: When Building AI Becomes Easier Than Measuring
It is easy to test whether a car drives. It is also possible to verify whether a software program works correctly. In computer science and engineering, we often deal with systems designed by humans themselves. Systems with clearly defined functions, interfaces, and relatively unambiguous criteria for success or failure. A clear signal.
Highly complex open systems are fundamentally different. In fields such as biology, psychology, medicine, or nutrition science, verification is far more difficult, indirect, and usually only possible statistically.
Society handles this complexity in very different ways.
Medicine, for example, is heavily regulated. A drug is approved only after years of studies confirming that it truly works. Nutrition science is regulated as well, but it is much harder to define clear boundaries there. Scientific findings, lifestyle trends, personal experiences, and commercial interests are deeply intertwined. As a result, half-truths, exaggerated health claims, and misunderstood studies can spread especially easily.
In many of these complex systems, there is also a strong asymmetry between creation, assertion, and verification. Placing a glass of water on the table and claiming it cures hair loss is easy. Properly proving whether that statement is true or false is extremely difficult.
We have now spent around three and a half years in the era of AI language models. These systems were complex from the very beginning, but during the first years they were still relatively easy to evaluate. A few carefully selected questions were often enough to get a good understanding of their capabilities. Even during the first half-year of coding agents in 2025, the first few lines of generated code often revealed obvious failure modes. Developers had to explicitly instruct agents about things that seemed completely obvious to humans.
Those times now appear to be over. We are entering the territory of highly complex open systems here as well. And the consequences are impossible to ignore.

The longer the time horizon in which agents operate, the more difficult their evaluation becomes. The error bars of new models illustrate how challenging reliable measurement has become.
The Rise and Fall of the Coding Benchmark SWE-Bench
The best example of this is SWE-Bench, the defining benchmark for coding agents in 2025: an automated benchmark in which models must solve real GitHub issues within real software repositories. An evaluation designed to be as realistic as possible.
But the original dataset itself was already highly error-prone, underspecified, and technically fragile. It required extensive manual cleanup through the SWE-Bench Verified initiative. Yet even after that effort, the real work only began. Further research showed that even the cleaned-up version still contained problematic tasks, insufficient tests, and questionable success criteria. At the same time, it became increasingly difficult to determine whether high benchmark scores reflected genuine generalization or whether models had simply memorized parts of this widely discussed benchmark. The fact that even OpenAI and Anthropic no longer consider the benchmark fully trustworthy is therefore far more than a side note.
Community-Driven Evaluation of Coding Agents
Even evaluating coding agents as an individual developer is becoming increasingly difficult.
Last November, OpenAI and Anthropic both released updates to their coding agents. As usual, benchmark scores improved slightly. Nothing particularly remarkable. Nothing special at first. But it took several weeks before the community noticed that something had fundamentally changed in terms of quality and it all culminated in this statement by Andrew Karpathy:
never felt this much behind as a programmer
Coding agents had suddenly started to work. No benchmark had captured that shift. And to this day, there is still no objective metric explaining exactly what changed.
A few weeks ago, the exact opposite happened. Negative feedback around Anthropic’s Claude Code began spreading through the community. Users claimed the agent had become worse. Initially, these were only subjective complaints. No proof existed. Eventually, however, measurable evidence emerged, and Anthropic admitted there had indeed been issues. One cause was surprisingly small: Anthropic had adjusted Claude Code’s system prompt to reduce verbosity. Opus tended to generate longer responses, which could actually help with difficult tasks but also consumed more output tokens. So Anthropic added an instruction roughly equivalent to: Text between tool calls should be limited to 25 words, final responses should stay under 100 words unless more detail was required.
At first glance, this sounded entirely reasonable. After all, most users of coding agents do not want to read novels. But this single additional sentence in the system prompt produced measurable side effects. Internal testing had shown no obvious regressions, so the change was rolled out into production. Only later, through targeted ablation testing where individual lines of the system prompt were selectively removed, did researchers discover that this length restriction had reduced coding quality by around three percent in broader evaluations.
And that is exactly the point. Not a major architectural redesign, not a new model, not an obvious inference bug. A single sentence in a system prompt was enough to noticeably shift the quality of a coding agent. Even more remarkable: the change was not absurd. It was plausible. It had a clear objective. It had been tested. And still, the impact only became visible later.
The list could easily continue from here. OpenAI, for example, struggled with Gremlins in their agents. Another issue that first surfaced in the community and required weeks of analysis to understand.
Building Has Become Cheap, Evaluation Has Not
We are dealing with highly complex, nonlinear systems whose behavior is often difficult to predict. Every modification, even a single sentence, can introduce unintended side effects. Evaluation can easily miss these effects because sufficiently detailed testing is often impossible. Automated evaluation only works when “good” and “bad” can actually be defined. And even then, the evaluation process itself may become so demanding that dedicated teams are needed just to build and maintain it.
This is a relatively new symptom in agent development. New versions are becoming increasingly difficult for humans to measure because the “signal” required for straightforward evaluation is no longer clearly visible. Perhaps it only emerges after weeks. Perhaps not through individual users, but only through collective community experience. New model versions or updated agent systems may soon become almost indistinguishable to humans in this regard.
At first glance, this may look like a crisis in computer science. But it is not. It is simply a fact about this new world and we must learn how to deal with it. Once you begin building agent systems yourself, or creating tooling around agents to improve workflows, you are no longer working primarily with quickly measurable outcomes. Instead, you operate based on plausible assumptions, indicators, and hypotheses.
Developers of non-trivial agents, currently mostly software engineers, will often no longer be able to perform fully automated evaluations. At least not with realistic levels of effort. Comprehensive and reliable evaluations increasingly require enormous resources, conditions that only frontier labs can realistically afford.
An older survey about production agents outside of coding already revealed this trend: 75% of agents enter production without automated benchmarks. Instead, the real evaluation happens continuously through user interaction during operation. That is still acceptable because, according to the same survey, there is currently no production system where the user does not retain the final decision.
One approach involves A/B testing: presenting users with two possible solutions and asking them to choose the better one. Or simply using thumbs up/down feedback systems together with feedback forms.
These are concepts of the kind:
I know it when I see it.
You may not be able to formally define what is good or bad, but you recognize poor outcomes once you encounter them or when you can compare alternatives.
This information must then flow back into the system to improve the agent. That such feedback loops can even work automatically is already demonstrated by prompt optimizers with feedback capabilities, such as those found in DSPy.
And entirely new approaches are emerging. The transcript of the chat history is always available. Every human correction is logged. Using this information over time is precisely the purpose of the “memory” features implemented by some agents. In the simplest case, they take notes about the user, but also about recurring corrections. And it is precisely this information that could later be collected and analyzed.
Agents Mature Through Usage, Not Inside the Lab
The solution for developers of agent systems therefore looks highly iterative and unfolds over long periods of time. And it will only work for those willing to persist. Users must participate as well. They need to provide feedback and avoid giving up simply because systems perform poorly during the first few months.
For developers, this requires a major shift in mindset. The barriers are far higher than in traditional software development. Timelines become difficult to estimate. But this may simply be a necessary consequence of operating in a world where nobody fully knows what good or bad actually means.
The good news is this: iterative approaches almost always lead to progress. Time works in your favor. The models improve. Problems get identified step by step. Users adapt. And perhaps even the process itself evolves once the benefits become clear. That is why giving up on a poorly performing proof of concept too early is the wrong approach. Especially considering the potential that agent systems can unlock.
For the broader community currently working with agent systems, I would like to see more transparency and honesty. Every claim that promises a “best way” for building agents, and every new AI tool, should really come with a disclaimer:
“I have not evaluated this consistently. My arguments are based entirely on indicators. In my case, within my project, with my workflow, and using this particular model, the tool subjectively led to better results. A reliable evaluation for your own project would probably take several weeks.”
I have not evaluated this consistently. My arguments are based entirely on indicators. In my case, within my project, with my workflow, and using this particular model, the tool subjectively led to better results. A reliable evaluation for your own project would probably take several weeks.
That would not be a weakness, but a step forward. It would highlight the tools and projects that have actually gone through the long and difficult process and can offer more than just a good gut feeling. And it would make visible where an entirely new market may emerge in the future not only for agents, frameworks, and tools, but for their evaluation.
Because this is where the real asymmetry lies: building something for the AI world has become dramatically easier. Knowing whether it is actually better has not.
Written by
Sebastian Macke
is Lead Software Engineer at QAware, where he supports and advises projects with a particular focus on the design of AI-driven applications, sustainable software architectures, and strategic [...]