The Reality of Using AI in Production (Not the Demo)

The Demo Is Not the Job

Almost everything you read about AI in software development is set in a perfect world.

A clean repository. A small, well-defined problem. A dataset that behaves. A prompt, a few seconds of generation, and then a screenshot of something that works on the first try. The demo ends, the applause happens, and everyone walks away believing the hard part is solved.

I build and ship real products for a living, and I can tell you the demo is not the job.

The job starts later. It starts when that generated code is sitting in a product with thousands of users, touching real data, under real load, at two in the morning when something breaks and a customer is waiting. That is where the difference between "code that was produced" and "engineering that was done" becomes painfully clear.

This article is about that gap. Not to discourage anyone from using AI, because I use it every single day, but to describe what it is actually like once you cross from the demo into production.

AI Makes Feature Development Incredibly Fast

Let me start with the good news, because it is genuinely good.

AI has changed the speed at which my team ships features. What used to take days of boilerplate, wiring, and scaffolding now takes hours. A new endpoint, a form, a data transformation, a first draft of a component: the AI gets us 80 percent of the way there almost instantly.

For a small team, this is enormous leverage. We can explore three approaches in the time it used to take to try one. We can prototype a feature in the morning and have a working version by the afternoon. The velocity is real, and anyone telling you otherwise has not seriously used these tools.

But velocity is not the same as reliability. And production does not reward velocity. Production rewards reliability.

Code That Looks Right but Is Not

Here is the first hard lesson.

AI is exceptionally good at producing code that looks correct. The structure is clean. The variable names are sensible. The logic reads well. In a code review, it passes the eye test almost every time.

The problem is that "looks correct" and "is correct" are different things, and the gap between them is exactly where production incidents live.

Dario Amodei makes a useful distinction: model capability may be scaling exponentially, but economic diffusion and production adoption are not instant. That gap is exactly where engineering discipline, evaluation, reliability, and product judgment matter.

The code the AI writes is usually correct for the happy path: the normal input, the expected user, the data shaped the way the example assumed. What it quietly skips are the edge cases. The empty array. The user in a timezone you did not think about. The null that should never happen but does. The currency with no decimal places. The string that is technically valid but breaks an assumption three functions deep.

These are not exotic bugs. They are the ordinary reality of real data and real users. And they are precisely the cases that a clean demo never exercises, which means they are precisely the cases the AI was never pushed to handle.

If you want to understand why this happens at a deeper level, it helps to know how these models actually work. An LLM is predicting the most likely next token based on patterns it has seen, not reasoning about your specific system and its invariants. Aziz wrote about this in detail in Vibe Coding Without Going Blind: Why You Need to Understand the Math Behind LLMs.

The short version is this: the model produces the most statistically plausible code, and the most plausible code is the code that handles the common case. Your edge cases are, by definition, not common. So they get left out, confidently and silently.

The Risks That Scale Brings

The second category of problems does not show up in development at all. It only appears at scale.

A generated database query can be perfectly correct and still bring your system to its knees when it runs against a real table with millions of rows instead of the twenty rows in your test database. The AI has no idea how big your data is. It does not know that this innocent-looking query will trigger a full table scan, or that this loop will make a network call per item, or that this join will explode in cost as the data grows.

The same goes for dependencies. AI will happily reach for a library to solve a problem. Sometimes that library is heavy, unmaintained, or pulls in a tree of transitive dependencies you now have to own. In a demo, none of this matters. In production, every dependency is a liability you carry: a security surface, a performance cost, a thing that can break during an upgrade.

These risks share a common trait. They are invisible in the small and dangerous in the large. The code passes every test you wrote, runs fine on your machine, sails through the demo, and then behaves completely differently the moment it meets production scale. That is not a failure of the AI. It is a failure of expecting a tool that has never seen your traffic to reason about your traffic.

Debugging in Production: Where AI Genuinely Helps

Now let me talk about the part of production work where AI has become a real asset for me: debugging.

When something goes wrong in production, the first challenge is almost always volume. You are staring at thousands or millions of log lines, trying to find the one pattern that matters. This is where AI shines. It can read through enormous amounts of log data, cluster similar errors, surface anomalies, and point you toward the moment things started to go wrong far faster than I could by scrolling manually.

It is genuinely good at the first phase of debugging: narrowing the search space. Feed it a stack of logs and a description of the symptom, and it will often hand you three plausible hypotheses in seconds. That is a real acceleration of the investigation.

I have written about both sides of this experience, the cases where AI debugging works brilliantly and the cases where it falls apart, in [AI in Debugging: Best and Worst Case Scenarios](/blog/ai-debugging-best-worst-scenarios). If you spend a lot of time fixing production issues, that piece pairs directly with what I am describing here.

Where AI Stops and Engineering Begins

But here is the line, and it is an important one.

AI is excellent at analyzing the logs in front of it. It is not good at understanding the system those logs came from.

When a production problem depends on the real context of your system, the actual behavior of your users, the history of architectural decisions, the subtle interaction between three services that were each built by different people at different times, the AI is working blind. It can describe what the logs say. It cannot tell you that this error only happens for users who signed up before a certain migration, or that the root cause is a race condition that only appears under a specific load pattern you happen to know about because you have lived with this system for two years.

That final diagnosis, the part where you connect a symptom to a cause that is not written anywhere in the logs, still requires engineering experience. It requires someone who holds the mental model of the whole system in their head and can reason about it. The AI can get you to the doorway of the answer remarkably fast. Walking through it is still on you.

This is not a temporary limitation that better prompting will fix. It is structural. The most important context about your production system has never been written down. It lives in the heads of the people who built it.

Accelerator, Not Replacement

So where does this leave us?

I think the honest framing is this: AI is a powerful accelerator for technical teams, and production is where you discover the limits of that acceleration.

In development, AI multiplies your speed. In production, it multiplies your speed only up to the boundary of what it can understand, and then engineering judgment takes over. The teams that get this right are not the ones that use AI the most or the least. They are the ones that know exactly where the handoff happens.

Here is how I think about that handoff in practice:

Use AI to generate, then engineer the edges. Let it write the happy path fast, then spend your human attention on the edge cases, the failure modes, and the inputs it never considered.
Treat generated queries and dependencies as suspects, not solutions. Ask what happens at 1000x the data. Ask what this dependency costs you to own.
Let AI triage logs, but you diagnose the system. Use it to narrow the search and form hypotheses. Reserve the final root-cause call for the person who understands the architecture.
Review for correctness, not appearance. "It looks right" is the most dangerous sentence in an AI-assisted codebase. Looking right is the model's specialty. Being right is your responsibility.
Keep a human who holds the whole model. Someone on the team has to understand the system end to end, because that understanding is exactly what the AI does not have.

Daniela Amodei’s point on trust and reliability captures the core reality of production AI: the value is not in a model that looks impressive in a demo, but in a system that remains safe, predictable, and dependable when real users and real business workflows rely on it.

The Difference Production Reveals

The clean demo will always look more impressive than the messy reality. That is the nature of demos. They are designed to show what is possible under ideal conditions, and AI under ideal conditions is genuinely remarkable.

But software is not built under ideal conditions. It is built under real load, with real data, for real people, and maintained over real time. Production is the environment that does not care how good your demo looked.

That is also why I am not worried about engineering becoming obsolete. AI has made me dramatically faster at producing code. It has not made me less necessary for shipping reliable systems. If anything, it has sharpened the distinction. Generating code is now cheap and fast and largely solved. Engineering, the discipline of making something work reliably in the unforgiving conditions of production, is exactly as valuable as it ever was.

The demo is where AI shows you what it can do. Production is where you find out what you still have to do. Both are real. Only one of them is the job.

The Reality of Using AI in Production (Not the Demo)

The Demo Is Not the Job

AI Makes Feature Development Incredibly Fast

Code That Looks Right but Is Not

The Risks That Scale Brings

Debugging in Production: Where AI Genuinely Helps

Where AI Stops and Engineering Begins

Accelerator, Not Replacement

The Difference Production Reveals

More Articles

A Cowboy Hat Company Found Us Through Gemini. Now I Think About SEO Differently

Vibe Coding Without Going Blind: Why You Need to Understand the Math Behind LLMs

For Years, I Thought I Needed a Team. Maybe I Needed Leverage.