ESSAY

Pilot Purgatory.

For Heads of AI watching their second pilot stall.

88% of AI pilots don't reach wide deployment. Only 4 of every 33 launched go live. This is not a technology problem. It is an operating problem — and there are specific moves that close the gap.

The pattern.

It starts the same way every time. A team identifies a use case — document summarization, customer query routing, contract review — and builds a proof of concept in a few weeks. The demo is compelling. The room claps. Someone from the business side asks when it will be in production.

Then nothing happens for six months.

Not because anyone decided to stop. Not because the technology failed. Because nobody built the four things the pilot was missing: an eval harness to prove quality at scale, a cost model to prove viability at volume, a failure-mode catalog to prove resilience against real inputs, and a production owner who will still be watching it in three months.

This is pilot purgatory. It is not a graveyard — the work isn't dead. It is suspended between demo and production, indefinitely, while the team runs another pilot somewhere else.

“They mistook Proof of Concept activity for progress. Many PoCs were driven by peer pressure and tooling excitement, not by a clearly defined business problem, value stream, or operating model.”

The share of enterprises abandoning most of their AI initiatives jumped from 17% in 2024 to 42% in 2025. That number is not a commentary on whether AI works. It is a commentary on whether the organizations running pilots have the operating infrastructure to graduate them.

Most don't. And the gap is not closing on its own — because the same teams running failed pilots are already running the next one. The pattern doesn't break until the operating model changes.

The four failure modes.

Every pilot that dies in purgatory dies for one or more of these reasons. They are not bad luck. They are architectural choices — made by omission rather than commission.

01

The naive demo loop

The pilot succeeds by the only measure anyone defined for it: the demo runs, the stakeholders nod, the meeting ends with applause. Nobody asked what happens when the first real user sends an ambiguous query, or what the system does when the retrieval step returns nothing useful. The pilot has no failure-mode catalog because nobody built one. When production arrives, the failure modes do too — and now they're live.

02

Cost runaway

A demo that costs $40 in inference during a pilot can cost $40,000 per month at production volume. Multiply by retries, by long-context queries, by the fact that you'll probably need to call the model two or three times per user-facing answer — retrieval, generation, validation — and the inference line item alone can outpace the engineer who built the system. Most pilots have no cost model. They have a credit card and a demo.

03

Missing eval harness

The pilot runs, the output looks good, and the team ships with confidence. Three months later the system is quietly producing wrong answers on a class of inputs nobody tested for. The eval set showed 95% quality — but it was written in a conference room, not sourced from production logs. An eval harness built against real usage data would have caught this in week two. The pilot had neither.

04

No production owner

The pilot was run by a small team under deadline pressure with the implicit understanding that the business team would take ownership once it worked. The business team has no idea how the system works. The engineering team has moved on to the next pilot. When the model provider releases an update and behavior shifts, nobody is watching. The system degrades silently until a user complains loudly enough to escalate.

“The teams shipping AI agents successfully in 2026 aren't the ones with the best models — they're the ones with the best evaluation infrastructure. Evaluation is differentiation.”

Three operating moves.

These are the changes I see in every pilot that actually graduates. They are operating decisions, not technology decisions. You don't need a better model to make them.

01

Define success before architecture

Not 'the demo works.' Not 'the team is excited.' What specific user action, at what quality level, measured how, at what inference cost, constitutes production-ready? Write this down before you choose a framework. The teams that skip this step run pilots until they get tired of running pilots. The teams that don't skip it build toward a gate they can actually pass.

02

Build the eval harness against real production logs

Not against the inputs you imagine. Not against the inputs from the demo. Against the inputs real users sent before you had AI in the loop — support tickets, chat transcripts, edge cases that made your team wince. Build the eval harness before you build the feature. The harness tells you whether the thing you built is actually working. Without it, you are shipping on vibes.

03

Route by cost, not by capability

The most capable model is not the right model for every task. Classification tasks don't need Claude Opus. Summaries don't need GPT-4. A routing layer that picks the cheapest model that can hit the bar for each task type is not premature optimization — it's the difference between a cost model that works and an inference bill that kills the project. Vertical wins first: prove the unit economics in one domain before generalizing.

The model is rarely the problem.

In the pilots I've worked through that stalled, the bottleneck was almost never the model. It was the harness engineering — the infrastructure around the model that the team skipped because the demo didn't need it. Production does.

What operator-grade looks like.

Operator-grade is not a technology tier. It is a decision-making posture. It means you defined success before you chose the architecture. It means you built the eval harness before you built the feature. It means you have a cost model, a failure-mode catalog, and a named person who will still be watching the system in six months.

Most teams building AI pilots don't have any of these things. They have a clever system prompt and a deadline. That is a recipe for purgatory.

The gap I work in is specifically this: between the pilot that ran successfully and the production system that nobody built. I come in after the demo, before the project gets quietly cancelled, and I do the operating work — the eval harness, the cost model, the failure-mode catalog, the production handoff plan — that the pilot team didn't have the mandate or the capacity to do.

Behavior design matters here too. The most damaging assumption of the last two years was: once the tools are good enough, adoption will follow. Leaders underestimated that behavior follows incentives and manager reinforcement — not tool quality. An operator-grade engagement accounts for this. It doesn't just ship the system; it answers the question of whether the team using it will actually use it.

“After three years of shoveling millions into AI, most organizations haven't actually changed — and this isn't a tech failure, it's a failure of leadership to adapt.”

The role of the CAIO is evolving from symbolic appointment to operational accountability. Boards that spent 2024 asking “what's your AI strategy?” are now asking “what did it cost, what did it return, and how do you know?”

That shift is uncomfortable for teams that have been running pilots without production-readiness criteria. It is clarifying for teams that have the eval harness, the cost model, and the production owner in place. The operating work is not overhead — it is the answer to the board question.

If your pilot is stuck.

I work with Heads of AI at Series A through pre-IPO companies whose pilots have stalled before production. The engagement starts with a working session — not a discovery call. We look at the actual system, identify the specific gap, and come out with a concrete plan.

Book a working session →