82W DigitalBy Brian ClineJune 10, 20264 min read

The Eval Suite Is the Spec

Defining what 'working' means before you build is the discipline that turns a non-deterministic system into something you can actually ship.

Every software project has a moment where someone asks, "How will we know it works?" In traditional software the answer is settled practice: unit tests, integration tests, acceptance criteria. The question has a grammar, and everyone knows how to speak it.

Ask the same question about an AI system and watch the room get quiet. The model is probabilistic: the same input can produce different outputs across runs, temperatures, and model versions. The output is often prose, or a judgment call, or a multi-step trajectory through tools, where "correct" is a spectrum rather than a boolean. The instinct trained by twenty years of test-driven development (write an assertion, make it pass) doesn't survive contact with a system that is allowed to phrase things differently every time.

Most teams respond in one of two ways. They skip the question and ship on vibes, which works right up until the first quiet regression in production. Or they freeze, demanding deterministic guarantees a probabilistic system cannot give, and the project dies in review. Both failures have the same root: nobody defined what working means in terms the system can actually be held to.

There is a third way, and it's becoming the load-bearing discipline of AI delivery: the eval suite is the spec.

What an eval suite actually is

An eval is a graded test for a probabilistic system. Instead of asserting an exact output, you assemble real cases, actual inputs from the actual workflow, and grade the system's responses against a rubric: factual accuracy, policy compliance, completeness, format, whether the right tool was called with the right arguments. Run the suite, get a score. Run it again after any change (a prompt edit, a retrieval tweak, a new model version) and see whether the score moved.

The crucial properties are mundane and powerful: the suite is version-controlled, it lives in the repo next to the system, and it runs automatically. It is to AI behavior what a regression suite is to code, except it does even more work, because in an AI system the behavior is the product.

The best cases come from the edges, not the middle. When I build a suite with a client, the first session is always the same request: bring me the cases you argue about. The denial letter that two reviewers graded differently. The query the last vendor's demo got embarrassingly wrong. The edge case the compliance officer keeps citing. Real, hard, and contested, these golden cases define the system far better than a hundred synthetic happy paths, because they encode the judgment the organization actually cares about.

Write it before you build

Here's the part that changes project economics: the suite gets written before the build, and the build doesn't start without it.

This sounds like process for its own sake. It isn't. Writing the eval suite first forces every hard conversation to happen early, while it's cheap. What does a passing answer look like for this question? Who decides? What failure is annoying versus what failure is a reportable incident? What's the threshold? 90% on the golden set? 99% on the safety-critical subset? Teams discover, regularly, that they disagree about the answers, and it is much better to discover that in week two than after the system is live.

It also transforms the contract. When the eval suite is the acceptance criteria, "done" stops being a negotiation between a vendor who wants to invoice and a buyer who feels vaguely uneasy. The suite passes at the agreed threshold or it doesn't. I put it in writing as a gate: no eval suite, no build. It protects both sides: the client from a system that only works in demos, and me from a definition of success that drifts every time someone new joins the steering call.

Grading at scale: judges and humans

For suites beyond a few dozen cases, human grading stops scaling, and the standard move is LLM-as-judge: a second model grades the first model's outputs against the rubric. Used naively, this is circular: a model marking its own homework. Used properly, it's calibrated: humans grade a sample, the judge grades the same sample, and you measure agreement before trusting the judge with the rest. Where the judge and the humans diverge, the rubric is usually ambiguous, and tightening it improves both the grading and your understanding of the problem. Recalibrate on a cadence, and on every model change.

The payoff compounds at exactly the moments that hurt most without it. A new model version ships, faster and cheaper and claimed to be better. Without an eval suite, the upgrade decision is a leap of faith followed by weeks of anecdotes. With one, it's an afternoon: run the suite, compare the scores, look at the specific regressions, decide. The same suite that gated the original build becomes the instrument panel for the system's entire operating life.

The artifact that outlives the engagement

That last point is the one I'd most like buyers of AI work to internalize. Of everything a consultant leaves behind (the architecture, the code, the documentation), the eval suite is the artifact with the longest half-life, because it encodes the definition of working in a form that survives personnel changes, vendor changes, and model changes.

A system delivered without one is a system you cannot safely change, and a system you cannot safely change is already dying. A system delivered with one can be operated, regressed, upgraded, and extended by your own team, which is the actual goal of paying someone to build it.

So if you take one question into your next AI engagement, make it this one, asked before any building starts: show me the eval suite. If the answer is a confident description of how thorough the testing will be later, you've learned something important, and it's much cheaper to learn it now.

← More writing