
23 June 2026
The Eval Paradox: what makes language models useful is what makes them hard to test
by Nicholas Holden
The better part of two decades ago, as a PhD student working on swarm intelligence and machine learning, I spoke to an industry affiliate at a conference preparing a particle swarm optimiser for production (a classically probabilistic algorithm). His entire strategy for making the thing reliable was to run it fifty times and keep the best result. I found this faintly scandalous. Particle swarm optimisation is an elegant, efficient way to search a difficult space; running it fifty times and taking the winner collapses that elegance into something close to brute-force Monte Carlo, throwing away much of the efficiency that justified the method in the first place. And yet it shipped, and it worked. The lesson has stayed with me ever since: getting a probabilistic algorithm into production sometimes demands drastic, inelegant measures, and that is an acceptable price.
I have been thinking about that fifty-times trick a great deal lately, because the technology industry keeps relearning the lesson behind it. A large language model is the same kind of object: an elegant, powerful probabilistic engine that demos beautifully and then resists every effort to make it behave dependably. The gap between the impressive best case and something you can actually rely on turns out to be most of the work, and almost none of it is elegant — it is brute force, statistical scaffolding and defensive engineering wrapped around a tool that was never meant to need any of it. That, it turns out, is simply what putting a probabilistic system into production costs. We have spent a decade and a vast sum of money to arrive back where that engineer stood: a powerful, probabilistic tool, and a crude wrapper to make it behave. The interesting question is why, and whether there is anything better.
The paradox at the centre
The defining feature of a large language model is that it handles inputs too varied to enumerate. That open-endedness is the entire value proposition: you reach for a model precisely when you cannot specify in advance every input the system will meet and every output it should give. But that same property is what defeats conventional software testing. You cannot write all the test cases up front, because if you could fully specify every input and its correct output, you would simply write the deterministic function instead and dispense with the model altogether.
This is the paradox that ought to frame any serious discussion of AI evaluation. The usefulness of a language model and the ease of testing it are inversely related. The places where the technology earns its keep e.g., summarising messy documents, fielding open-ended questions, navigating ambiguous instructions, are exactly the places where no finite suite of tests can certify that it is right. Practitioner surveys now put evaluation at well over half of the development effort on serious AI products, most of it spent not writing checks but trying to understand failures. The bottleneck has moved from capability to assurance.
You might conclude from this that such systems simply cannot be trusted in production. That conclusion is too quick, and the financial industry is the standing refutation. Having spent most of my career building software inside investment banks and asset managers, I can attest that probabilistic systems have run in production there for decades: Monte Carlo risk engines, statistical fraud scoring, machine-learning credit and pricing models. They are validated not by exhaustive test cases but by confidence intervals, monitoring, tolerances and governance. Probabilistic behaviour, in other words, has never been the obstacle. The obstacle is measurement you can trust.
One caveat is due before the comparison is pushed too far. The financial precedent is flattering in a way, because those systems are largely self-contained, each model bounded and monitored on its own terms. The newer breed of multi-agent AI, in which one model's output becomes another's input, multiplies the points of interaction and compounds the danger. Errors do not merely accumulate, they amplify, as each agent treats the noise it inherits as signal and launders an upstream misjudgement into apparent fact, and intent drifts a little further from the original task at every hop. Whatever measurement regime works for a single model must here contend with a system feeding on its own output.
Let me be blunt about something the model vendors would rather you did not dwell on: these are, at bottom, probabilistic systems. A single token landing differently, a quirk of sampling or a rounding in the probabilities, can swing an output from sensible to absurd. The marketing favours the language of reasoning and reliability; the underlying machinery remains a probability distribution over the next word. Whether that is a problem, however, is a more interesting question than it first appears. Our expectations of risk have always been calibrated to fallible agents. A human analyst can have an off day, misread a figure or draw the wrong conclusion. Such mistakes are hardly excused (careers have ended over less), but no bank has ever relied on its people being infallible. It relies instead on controls: limits, sign-offs, reconciliation, the many-eyes check, all designed to catch and contain the bad day before it becomes a bad quarter. The pertinent question for a language model is not whether it is probabilistic — it is, unavoidably — nor even whether it can be made to reproduce its own outputs, but whether it is so sensitive to small changes in input that a trivial rephrasing flips a sensible answer into a wrong one, and whether we can surround it with controls that bound its worst day as well as we already bound a person's.
What the car industry already learned
There is a sharper precedent still, and it comes from an unlikely quarter. For decades, safety-critical software in cars ran on single-core processors, with the most critical tasks deliberately confined to a single core. The reason was determinism. Running tasks in parallel introduces timing interference between them, which makes it extraordinarily hard to analyse the worst-case execution time and, therefore, to prove that you have covered every possible ordering of events. Because the industry could not test all the possibilities, it largely refused the more powerful approach. Multi-core processors arrived in vehicles only gradually, forced by the end of chip-frequency gains rather than by anyone solving the verification problem. Aviation followed the same arc more formally. Fly-by-wire flight computers long sidestepped the issue with replicated single cores and dissimilar redundancy, and multi-core processors became certifiable only once dedicated airworthiness guidance laid out how to do it. That guidance, known as CAST-32A, has now been absorbed into the US Federal Aviation Administration's AC 20-193; the first certification under its predecessor CAST-32A arrived around 2021.
This is the part the AI debate tends to miss. The problem was never solved. It was managed. Neither industry learned to test every possibility. They engineered the non-determinism back out of the safety-critical path, pinning tasks to cores, partitioning memory and time, identifying each channel of interference and proving a bound on it, until behaviour was predictable enough to certify. The intellectual move was to stop trying to prove the system correct in all cases and instead prove a tolerable worst-case bound: whatever happens, the system stays inside an envelope you can live with.
That distinction, bounding the worst case rather than enumerating every case, is the most useful idea available to anyone evaluating AI, provided one is candid about where the analogy breaks. Cars and aircraft regained testability by shrinking their systems back towards determinism. That escape hatch is closed to the language model. Constrain a model until its inputs are enumerable and you have destroyed the open-endedness that was the whole point. You cannot retreat to a fully specified space without throwing away the technology.
What does transfer is the worst-case instinct, redirected. If you cannot bound the behaviour, bound the failure. Engineer the system so that when it gets something wrong, and it may, it fails gracefully and within a tolerable envelope: answers that hedge or abstain when confidence is low, thresholds that route hard cases to a human, guardrails that cap the blast radius, deterministic checks on the parts that can be checked so that a bad generation cannot pass silently. The goal shifts from "is every output correct", which is unanswerable, to "is every failure survivable", which is both achievable and the right question to ask.
Enforce what you can; judge only what you must
This reframing has immediate, practical consequences, and the first is to stop testing things that should never have been tests. A great deal of what passes for "evaluating the model" is misplaced effort. You do not need a test suite to confirm that a model can produce valid JSON; you make invalid JSON impossible, or recoverable, by construction. Mature frameworks now enforce structure through constrained decoding, schema validation and typed function-calling, with automatic repair or rejection when output strays from the contract. This is resilience by design: the property is made structurally true rather than empirically hoped for. It is the same fail-gracefully envelope the car engineers built, applied at the input-output boundary, the deterministic scaffolding that lets you tolerate a probabilistic core.
Strip away everything that can be enforced or deterministically checked (did the agent call the right tool, in the right order, with valid arguments; did the citation resolve to a real document; did the output parse) and you are left with a much smaller, genuinely subjective residue. Is the summary good? Is the tone right? Is the answer complete? Only here is human-style judgement unavoidable, and only here does the industry's most fashionable technique earn its place.
The trouble with asking a model to mark its own homework
That technique is "LLM-as-judge": using one language model to score the output of another. Its appeal is obvious. It is cheap, fast and scales to volumes no human review could touch, and proponents cite agreement with human raters of 80 to 90 per cent. Its danger is equally obvious on reflection: you are stacking a second probabilistic system on top of the first, and inheriting all the uncertainty of both.
The evidence that these scores are shakier than they look is mounting. Composo, a London evaluation start-up, has shown that the same hallucinated passage can score 60 per cent on a one-to-five scale, 30 per cent on a one-to-ten scale and 85 per cent on a one-to-hundred scale, with the scale itself, rather than the quality of the text, driving the result. Even with the model's randomness turned to its minimum, repeated runs on identical input scatter; a single score tells you little about the true mean. The firm's broadside is titled, memorably, "LLMs: Great Witnesses, Terrible Judges", the argument being that models are good at observing what happened but bad at putting a calibrated number on it.
Some of this is self-inflicted, and the better practitioners have learned to design it out. Much of the instability comes from asking for an absolute number at all, so the fashionable techniques now avoid the open-ended scale: pairwise comparison, where the judge picks the better of two responses rather than scoring one in a vacuum; binary pass or fail against an explicit, written criterion; and reference-based grading, where the output is checked against a known-good answer. Each narrows the room the judge has to wander. None of them closes the gap with an ordinary software test, though, and it is worth being clear about why. A conventional assertion either passes or it does not; a model judge can hallucinate its verdict, confidently failing a correct answer or passing a wrong one, in a way no deterministic check ever will. You can reduce the noise. You cannot make the judge incapable of being wrong.
Nor is averaging a stack of noisy runs the only answer, and here the financial precedent returns with force. The discipline that tamed Monte Carlo risk engines and statistical credit models, not exhaustive testing but confidence intervals, error bounds and governance around an estimate known to be imprecise, has now arrived for language-model judges. Recent work treats the judge as exactly that, a noisy instrument, and carries its error rates through to the conclusion, so that a passing or failing eval arrives with a quantified false-positive and false-negative risk rather than a bare number (arXiv). It is the same move finance made decades ago: stop pretending the estimate is exact, and bound how wrong it can be. The frontier of evaluation is not a better score. It is a calibrated one.
For most teams the pragmatic path runs through neither purity nor faith. The open-source platform Langfuse, acquired by the database company ClickHouse in January, has become a default precisely because its centre of gravity is capturing what models actually do in production: traces, tool calls, retrieval steps. That matters because the data problem is the real one. Since you cannot specify the input distribution in advance, evaluation has to shift from writing test cases to harvesting them, mining real traffic, clustering the failures and building eval sets from the messy reality rather than an imagined version of it. You cannot simulate all the possibilities; if you could, you would not need the model. So you instrument the system, watch it meet the world, and let the world write your test set.
The realistic bar
The current discourse, much of it conducted on LinkedIn by the founders selling the tools, oscillates between two unhelpful poles: that evaluation is a solved engineering problem awaiting the right platform, or that language models are fundamentally untrustworthy and should be kept away from anything that matters. The car and aircraft engineers suggest a third position. They never tested their way to certainty. They bounded the worst case, contained the failure, and shipped: conservatively, with monitoring, and with a clear-eyed sense of what could go wrong and how badly.
That is the bar for AI, and it is both more modest and more attainable than the one the industry keeps setting itself. The question to ask is not whether the model is correct, but whether the system stays resilient when the model is wrong. Enforce what you can guarantee. Deterministically check what you can check. Reserve fallible judgement for the genuinely subjective remainder, and treat its scores as the noisy estimates they are. And design, always, for graceful failure, because the one thing you can be certain of with a probabilistic machine is that, sooner or later, it will surprise you. The engineer I met at that conference understood this. He could not guarantee the perfect answer every time, so he engineered for a strong answer, delivered dependably, and got the thing into production. Two decades on, with far grander machines, we are still learning the same lesson.
Editor's note: Composo performance figures are the company's own and are attributed as such. Vendor and product details current as of June 2026.



