How we work
Evals before build.
Transfer at the end.
Why a method
The gap is organizational, not technical.
Nearly every enterprise is piloting AI; only a small fraction of those pilots ever reach production. The failures repeat the same way each time. It's never a model that wasn't smart enough. It's an integration nobody scoped, quality that couldn't be measured, observability that didn't exist, and a system nobody owned after the consultants left.
The industry's default answer is embedded engineering: talented people parachuted in to make it work by hand. It does work. Then it leaves, and most of those systems are abandoned within a year, because the knowledge that kept them alive was never transferred. Billed hourly, that model is economically incentivized never to transfer it.
A method fixes what a better model can't: define working before building, gate every phase on a written exit condition, and architect the engagement to end.
~4 in 5
enterprises piloting generative AI
<1 in 5
of those pilots reaching production
most
vendor-embedded builds abandoned after the engagement ends
- Survey figures: directionally consistent, precisely contested
The discipline
Five things done every single time.
The archetypes vary, from RAG systems and agents to copilots and enrichment pipelines, but they fail the same way and they're saved the same way. This is the work that does the saving.
Evals-as-engineering
Passing behavior is defined as a version-controlled eval suite before the build starts: golden cases from your hardest real examples, graded automatically, regressed on every change and every model version. The suite is the spec, the acceptance criteria, and the artifact you keep.
Context engineering
Most of an AI system's quality lives in what the model sees: retrieval, tool design, prompts as versioned artifacts, the shape of the data at inference time. This is engineering work with engineering rigor, not prompt tinkering.
Integration into legacy reality
The system has to live next to the systems you already have: the EHR, the data warehouse, the ticketing queue, the approval chain. Integration scope gets validated in phase zero, before anyone falls in love with a demo.
Governance and guardrails
Constitutional rules, escalation paths, audit trails, and human review loops wired into the runtime. These are operational controls a regulator can inspect, not a policy PDF. In HIPAA and education settings, this is the difference between shipped and shelved.
Adoption and ownership
A system nobody owns is a pilot with extra steps. Every engagement names an owner on your team early, and the closing phases exist to make that ownership real: monitoring they watch, runbooks they wrote alongside me, changes they ship themselves.
The shape of an engagement
Five phases. Hard gates.
Every phase ends with a written exit condition. No gate, no next phase. The two phases everyone else skips are the first and the last.
00
Validation
1–2 weeks
Every failed AI project I've seen was failing before the build started: a use case that didn't match the data, an integration nobody scoped, an owner nobody named. Validation is one to two weeks of reconciling the brief against reality. That means reading the data, tracing the systems it has to touch, and naming who operates the thing when it ships. You leave with a written brief and a decision: move forward, or don't. Either answer is a good outcome.
Gate: the brief survives contact with your data and systems
01
Eval Definition
1–2 weeks
Probabilistic systems don't have a 'works on my machine.' The only honest definition of done is an eval suite: real cases, golden answers, graded automatically, version-controlled next to the code. We write it together before the build, drawing from your hardest real examples rather than synthetic ones, and it becomes the acceptance criteria in the contract. When a new model version ships next quarter, you re-run the suite instead of re-arguing the project.
Gate: no eval suite, no build
02
Build
3–6 weeks
The build is the most conventional phase, and the one that benefits most from the two before it. The eval suite is failing on day one and the work is making it pass: context engineering, tool design, guardrails wired into the runtime rather than described in a slide. Working software reaches real users inside the phase, because the feedback that matters doesn't come from demos. Exit is mechanical. The suite passes at the agreed threshold.
Gate: the eval suite passes at threshold
03
Production Hardening
2–3 weeks
Most pilots die in the gap between 'the demo worked' and 'someone owns this at 2am.' Hardening closes that gap: observability on every model call, alerting tied to the eval metrics, the integration load-tested against production traffic shapes, runbooks written for the failure modes we found in build. The exit condition is organizational as much as technical. There is a named owner, on your team, who has already operated it.
Gate: monitoring live, ownership assigned
04
Transfer
1–2 weeks
Transfer is priced like the most valuable phase because it is. Embedded engineering that never leaves is a rental; the industry abandons most of those systems within a year of the consultants walking out. The exit gate here is a rehearsal, not a handshake: your team ships a change, runs the evals, and operates the system while I watch. You keep the suite, the reference architecture, the integration code, and the runbooks. Those are the artifacts that keep it alive.
Gate: your team operates and extends it without me
What you keep
The engagement ends. The system doesn't.
Transfer is priced like the most valuable phase because it is. These artifacts are what keep the system alive after I'm gone.
The eval suite
Version-controlled, runnable by your team, the permanent definition of what working means. Re-run it when the next model version ships.
Reference architecture
The system as built, documented as built: boundaries, data flows, failure modes, and the reasoning behind the load-bearing decisions.
Integration code
The connectors and glue between the AI system and your actual stack, written to your conventions, owned by you.
Runbooks and monitoring
Dashboards tied to the eval metrics, alerts for the failure modes we found in build, and the 2am instructions for each one.
Teach-back
Transfer ends with your team operating the system while I watch: shipping a change, running the evals, handling a failure. Rehearsed, not promised.
Engagement shapes
Two ways in. One honest constraint.
If your AI spend is already committed
Value-realization sprint
An audit of the committed spend and stalled initiatives in week one, then one use case taken through all five gates, ending in a value memo your leadership can take to the board. Proof, not another pilot.
If the problem is new
Prototype engagement
Validation and eval definition up front, a working system in real hands inside weeks, and a hard decision gate before any production investment. Fixed fee through the early phases, and either answer at the gate is a good outcome.
Capacity
Three to four projects a year
82W is one senior practitioner, deliberately. Every engagement gets the same hands from first call to teach-back, which means fit gets decided honestly on the first conversation.
Start a conversation
Got a problem worth
shipping an answer to?
One email. Describe the problem, the stakes, and the timeline. We reply to everything that isn't a sales pitch.
