82W DigitalBy Brian ClineJune 10, 20265 min read

The Models Work. The Deployments Don't.

The constraint on enterprise AI has moved from model capability to delivery capacity, and the gap is organizational, not technical.

Somewhere in your organization right now, there is probably an AI pilot that impressed everyone in the demo and has not shipped. It is not alone. Industry surveys put the share of enterprises running generative AI pilots somewhere near four in five, and the share of those pilots reaching production somewhere under one in five. The precise figures are contested; the direction is not. The pattern is so consistent across companies, industries, and vendors that it deserves a name: the delivery gap.

Here is the part that took the industry a surprisingly long time to accept. The gap is not the model's fault. The models work. They have worked, for most enterprise-relevant tasks, for a couple of years now. Each new release makes the demo better, and the demo was never the problem.

Where pilots actually die

When you do post-mortems on stalled AI initiatives (and I have done a number of them), the same five causes of death come up over and over, in different costumes:

Legacy integration. The agent has to act against systems that were never designed to be acted upon. The pilot ran against a clean export; production means the real EHR, the real ERP, the real approval chain, with their authentication quirks and undocumented edge cases. Nobody scoped this, because the demo didn't need it.

Quality at volume. The behavior that looked flawless across fifty hand-picked examples degrades across fifty thousand real ones, across model version updates, across the long tail of inputs nobody thought to test. Without a way to measure quality, "it got worse" is a vibe, and vibes don't survive a budget review.

No observability. Something fails in production and nobody can see why. Traditional software has stack traces; an unobserved AI system has a shrug. If you can't see why it's failing, you can't fix it, and a system that can't be fixed gets turned off.

No owner. The pilot belonged to an innovation team, a vendor, or an enthusiastic individual. Production systems need someone whose job it is when the thing misbehaves at 2am. "Unclear organizational ownership" sounds like consulting filler until you watch a working system die of it.

Generic model, specific work. The model knows everything in general and nothing about how your claims process actually adjudicates the weird cases. Closing that gap takes curated domain data and context, which is ongoing work, not a one-time setup.

Notice what's missing from that list: "the model wasn't smart enough." None of these five is solved by a better model. All of them are solved by competent delivery with a method. That single observation should reorganize how you buy, build, and staff AI work.

Why this breaks the traditional playbook

It's tempting to conclude that this is just systems integration with a new coat of paint: call the usual integrators, staff the usual bench. But generative AI delivery breaks several assumptions that traditional delivery models were built on.

Traditional software is deterministic: same input, same output, testable with unit tests and regression suites. A probabilistic system varies by run, by model version, by context. The entire validation toolchain has to be rebuilt around evals, graded test suites of real cases, rather than assertions. Code, once written, is stable; prompts and context pipelines are living artifacts that need continuous tuning as models and data shift. Users adapt to traditional software; users distrust AI, which means adoption is a process-redesign problem, not a training-video problem. And the security surface is genuinely novel: prompt injection, data leakage through model outputs, hallucination as a compliance risk.

Each of these is manageable. Together, they explain why a deep bench of excellent traditional engineers does not automatically convert into AI delivery capacity, and why the talent market shows a multiple of demand over qualified supply. The limiting discipline isn't writing code. It's things like context engineering (deciding, by design, what information reaches the model and what noise is filtered out) and eval construction, which are still young enough that almost nobody has them as a settled practice.

The forward-deployed paradox

The industry does have one proven answer to the delivery gap: embed great engineers directly in the customer's operation until the thing works. Palantir validated this model over more than a decade; every AI lab has since copied it, because it genuinely crosses the gap between the brief and reality.

But the embedded model carries a paradox. It works. It doesn't scale, and it doesn't stick.

It doesn't scale because it's linear: one scarce, expensive engineer, one customer, for months. You cannot hire your way across a structural talent shortage.

It doesn't stick because of an uncomfortable incentive. One analyst projection, directionally credible even if you discount the number, suggests most enterprises will abandon vendor-embedded agentic systems once the vendor support ends, because the cost never came down and the skills never transferred in-house. Billed by the hour, embedded engineering is economically rewarded for not transferring capability. Every runbook the consultant doesn't write, every eval suite that lives in their head instead of your repo, is another billable month. Few people behave cynically on purpose; the incentive does the work quietly.

What a method looks like

If the gap is organizational, the fix is organizational: a delivery discipline that treats the five failure modes as first-class engineering problems, and an engagement structure that is architected to end.

In practice, that means a few non-negotiables. Validation before anything else: reconciling the stated need against the actual data and systems, because most doomed projects are doomed before the first line of code. A version-controlled eval suite written before the build, so "working" has a definition that survives arguments, model upgrades, and personnel changes. Observability and ownership treated as exit criteria, not afterthoughts: a named person on the client's team who has already operated the system before the engagement closes. And transfer as a deliverable in its own right: the eval suite, the architecture, the runbooks, and a rehearsed handoff, priced as the most valuable phase rather than tacked on as a courtesy.

None of this is glamorous. That's rather the point. The glamorous part of AI, the model, is done, commoditized, and getting cheaper by the quarter. What's scarce is the unglamorous discipline that turns a capable model into a system someone owns. The organizations that figure this out won't have better AI than their competitors. They'll have AI that's actually running.

← More writing