From Pilot to Production

The wall.

You have a working demo. Investors liked it. The pilot user was impressed. You have a development team — maybe three engineers, maybe offshore, maybe you and a co-founder building at 2am. And now you have to ship this to real users, at real volume, on real run rate.

The wall is the gap between what the demo proves — the model can do the thing — and what production requires: the model can do the thing reliably, measurably, at a cost that fits your burn, without failing silently on the inputs real users send.

Most founders hit this wall alone. They have engineers who can build, but no one who has been through this specific gap before. Every architecture decision they make right now — model choice, RAG vs. long context, eval strategy, latency targets — will hold or break them over the next 12 months.

“Anyone can build a demo. Shipping AI to production is a completely different sport.”
— Replify engineering team. GeekWire, “A reality check on AI engineering”

Token economics surprise everyone. The evals you wrote in a conference room don't catch what happens when an angry user sends an ambiguous request at 11pm. The model you chose in month two of the build might be wrong — and migration will cost you two sprints you don't have.

This is not a technology problem. It is a decision problem — and the decisions compound faster than the code.

What's actually missing.

Every pilot that dies before production is missing one or more of these four things. They are not glamorous. They are not what you demo to investors. But without them, you don't have a product — you have a prototype dressed up as one.

Eval harness

A systematic way to measure whether the AI output is actually good — built against real production inputs, not inputs you imagined in a conference room. Without it, you're shipping on vibes. With it, you can catch regressions before users do.

Cost model

A demo that costs $40 in inference can cost $40,000 per month at production volume. Multiply by retries, long-context queries, and multi-step calls and the inference line item alone can outpace an engineer's salary. The cost model isn't optional — it's the gate between pilot and production.

Failure-mode catalog

What happens when a user sends an ambiguous query? When the retrieval step returns nothing? When the model hallucinates a number that someone will act on? The catalog is the explicit list of known failure modes and the handling logic for each. Three months into production you'll find new ones. You need the catalog to add them.

Production owner

The named person who will still be watching the system in three months. When the model provider releases an update and behavior shifts, who notices? When a user escalates an error, who investigates? A production system without an owner is not a production system — it's a pilot with better infrastructure.

“Define evals based on the failure states that you find rather than sitting in a room and coming up with a list of evals that you think are going to be right. Otherwise, all of your evals are passing, but customers still aren't happy.”
— Productboard. AI Evals for Product Managers

What we'd do together.

This is a 6–8 week engagement scoped to a startup. Not a strategy phase. Not a discovery workshop. We start with the actual system and we end with it in production.

Architecture review — one week

I look at what you built, not what you planned to build. We identify the specific gaps between your current system and production-ready: the eval harness you're missing, the cost model that wasn't modeled, the failure modes nobody cataloged. You get a written assessment with a ranked list of what to fix first.

Build the harness — two to four weeks

I work embedded with your team to build the missing infrastructure. Eval harness built against your real production logs. Cost model with actual numbers. Failure-mode catalog sourced from the edge cases your system has already hit. The goal is not a beautiful system — it's a shippable one.

Production handoff — one week

Runbook for the team that will own this after I leave. Monitoring setup. Regression test suite they can run without me. Named owner, defined escalation path, scheduled eval review. You leave with a production system, not a dependency on an advisor.

What I don't do.

I don't write strategy decks. I don't run discovery workshops that produce a slide. I don't take equity in exchange for advice that isn't accountable to a shipped system. If the agentic loop isn't running in production at the end of the engagement, the engagement isn't done.

Let's talk.

I take on two to three engagements at a time. When there's capacity, I prioritize founders who are closest to the production wall — the ones where the next 60 days determine whether this company ships or pivots again.

Current availability: limited. Applications reviewed within 48 hours.

Let's talk →How I work with founders →

From pilot to production.