The output isn't the answer. It's the diagnosis

Every time AI gets it wrong, it's showing you exactly where your thinking is incomplete. Most developers fix the prompt and miss the lesson.

Apr 21, 2026

🤖 Building with AI · For developers rethinking how they build
This week’s challenge: take a recent AI output that disappointed you and refine the thinking behind it instead of the prompt.

My Slack was a context-switching tax I couldn’t stop paying. Direct messages from people waiting on answers. Public channels I should have been following. Threads I’d dropped into and lost the plot of. Every time I opened the app to reply to one thing, I had to reconstruct what that thing was about, who was involved, what had already been said. By the time I had the context back, half an hour was gone and the reply was two sentences. The cost was real: decisions delayed, things slipping through, the slow weight of knowing I wasn’t keeping up.

So I spent one week building a tool to fix it. A context retrieval agent that would scan my channels and surface what mattered. The spec was clean: related messages, link summaries, timeline. No synthesis. Synthesis was a separate concern, deliberately kept out of scope.

The AI executed exactly to spec. I ran the first real test on my actual inbox. The output was completely useless.

Reading it took about as long as reading the originals would have. The structure was clean, the format was clear, and I sat there looking at days of work, feeling the exact same weight as before. Same backlog. Same triage waiting. The tool had solved the problem on paper. The actual problem hadn’t moved.

The reflex, in that moment, is to fix the prompt. Add structure. Constrain the response. Most developers I watch do exactly this:

Output misses.
Prompt gets tweaked.
Retry.

The loop runs until the output looks acceptable, and they move on. They’ve built nothing except a slightly better prompt.

I almost did the same thing.

Then I made myself stop and read what the output was actually telling me. The spec was built on a principle I’d never tested: that separating retrieval from synthesis protected clarity. The output had just demonstrated the opposite. Synthesis was the work that made retrieval useful, and I had specified it out of scope. I rewrote the spec to fold the two together, ran it again, and the output became something I could act on.

The AI hadn’t failed. It had executed perfectly against thinking that was incomplete in a way I couldn’t see until I read the result.

That’s what AI does. It compiles your thinking and shows you what you actually thought.

The diagnostic loop

Before the diagnostic loop makes sense, you need to recognize the anti-pattern it replaces. Call it the prompt chase: tweaking the request until the AI produces something acceptable, then moving on. Alter, execute, evaluate. Alter, execute, evaluate. The chase ends when the output looks good enough, not when the thinking behind the request is right.

Each lap of the chase costs almost nothing. That’s the trap. Two hundred small alterations later, the developer has something usable and has learned nothing about how they think. The diagnosis was available in every output and went unread.

The diagnostic loop replaces the chase with a different movement. The trigger is the same, an output that missed, but the next five steps run one layer up from the prompt:

Stop. Don’t touch the prompt. Don’t try again. The pause is where the loop happens.
Name the gap. Write one sentence describing what the output delivered. Write another describing what you needed. The difference between the two is the diagnosis.
Locate the layer. The gap lives in one of three places: the spec (you asked for the wrong thing), the category (you used a word that carried an untested assumption), or the context (you knew something you never articulated).
Refine that layer. Write down what was implicit and is now explicit. The tested assumption. The corrected category. The missing context. This is the work. It takes longer than rewriting a prompt and produces something a prompt never will.
Rebuild the brief. Now the prompt comes back to the table, but informed by refined thinking instead of by guessing what phrasing might work better.

The opening was the easy version of this loop, with the gap sitting on the surface of a spec I could just reread. Most gaps don’t sit on the surface. They hide inside the words you used or in the things you knew so well you never said them out loud.

Categories that hide in plain sight

The retrieval example was a specification gap. The next gap I hit was deeper, and the diagnostic was harder to read.

After the tool was working, I started inventorying its reusable parts, with one eye on something bigger. The Outcoders program needed building blocks that other developers could pick up and run, not just code I wrote for myself. Six components inside the tool were generic enough to live independently of the Slack-specific orchestration: a context retriever, a voice processor, a deduplication pattern, a few others. The intuitive move, the one any developer would reach for, was to extract them into shared libraries. Reusable code, single source of truth, standard pattern.

I asked AI to help me think through the extraction. The output came back: clean library structure, clear interfaces, sensible naming. Exactly what I’d asked for.

I almost moved on. The output was good. It matched the request. Then something started bothering me, an intuition that something could go wrong. The kind of low-grade discomfort you get when you’re about to ship something that’s correct on the surface and going to hurt you in six months. I made myself sit with it long enough to find out what.

The diagnostic question that surfaced was unexpectedly mundane: if I extracted these as libraries, what would the next person who wanted to use one actually do? The answer was uncomfortable. They’d import the library. They’d inherit its dependencies. Any change would force a coordinated release across every project that imported it. I’d be handing every Outcoder a tight coupling I was supposed to be helping them avoid. The reuse I was optimizing for would create exactly the kind of drag that kills momentum in a small program.

The output was correct against the wrong category.

What I had treated as utilities were capabilities. A library shares code at build time; a capability is something you call at runtime. A library couples its consumers together; a capability stays independent. The word “library” had been carrying assumptions I hadn’t examined, and the AI had executed against those assumptions perfectly.

Refining the prompt would have produced a better-organized library, and I would have shipped the wrong foundation to a program I was just starting. Refining the category produced an entirely different architecture: independent agents, composed at runtime, no shared imports. The implementation work that followed took a week. The decision itself took an afternoon, once the category was correct.

This is the part of the diagnostic loop that’s easiest to miss. The gap isn’t always in what you specified. Sometimes it’s in the words you used to specify it. The output reveals the assumption inside the word, but only if you’re listening for it.

The context you didn’t articulate

The third gap is the one I see most often in other developers’ work, and it took me longest to recognize in my own.

Someone in Slack asked a question. I drafted a reply with AI. The response came back: technically accurate, well-structured, complete. I read it and almost sent it. Then I noticed it sounded condescending. Not by much. Just enough that the person reading it would feel slightly talked down to. The information was right. The tone was wrong in a way that would matter.

The first instinct, again, was to adjust the prompt. Tell the AI to be warmer, less formal, more collaborative. That instinct produces an output that’s better in a generic way and still wrong for the specific situation.

The diagnostic question that worked was different: what does the AI not know that’s making this miss? The answer became obvious as soon as I asked. The AI didn’t know the person was non-technical. It didn’t know we were in a public channel where other people would read the exchange. It didn’t know that the question, which sounded innocent, came from someone who lacked the vocabulary to ask for what they actually needed. None of that was in the brief, because none of that was articulated in my own thinking. I knew it implicitly. I hadn’t externalized it.

What needed refining was the model of what makes a reply work, not the prompt. Audience metadata. Channel context. The gap between the literal question and the underlying need. Once those were explicit pieces of the brief, the output stopped missing.

The AI made visible something I’d been holding implicitly for years: the work of replying well lives in modeling the situation correctly, long before any words get written. I’d been doing the modeling unconsciously, which meant I couldn’t teach it, couldn’t delegate it, and couldn’t improve it. The diagnostic loop forced the model out into the open, where it could be refined.

What the loop actually builds

The previous editions were about making thinking visible: first to other people, then to yourself. This edition extends that practice to the third audience most developers interact with every day but never think of as an audience. AI consumes the same externalized thinking your colleagues do, and it consumes it more literally. When your thinking is implicit, a human colleague fills in the gaps from shared context. AI doesn’t. It executes what’s there. That’s what makes it a diagnostic.

The diagnostic loop is the simplest version of that practice. It’s available every time you work with AI, which for most developers now means several times a day. Each output is a chance to read the diagnosis. Most developers skip it. The ones who don’t skip it are building a different kind of capability with every interaction, while everyone else is running the prompt chase.

Every cycle through the diagnostic loop leaves a residue: a tested assumption, a refined category, a piece of context you’d been carrying implicitly that’s now explicit. The residue accumulates. Over months, that accumulation is what people call structured thinking. It isn’t a thing you have before you start. It’s the by-product of a loop you ran enough times.

A better prompt has a short shelf life, because the ground underneath it keeps moving: models get retrained, interfaces shift, and the exact phrasing that worked yesterday needs adjustment tomorrow. Refined thinking sits on more stable ground. A category you understood correctly last month will still be correct next month, an assumption you tested last quarter doesn’t need testing again, and a model of what makes a reply actually work isn’t going to expire when the next version of the AI ships.

The developers running the diagnostic loop are quietly building an asset that never shows up in any output. The ones running the prompt chase are stuck producing deliverables and nothing else.

Try This

Pick one AI output from the last week that disappointed you. Not a catastrophic failure. A mild miss. Something that was technically fine but didn’t actually solve your problem.

Run the diagnostic loop on it. The five steps are above. No shortcuts, no skipping to step five. The value lives in step one, the pause, and in step three, naming which layer the gap lives in. Most people rush through those two and end up refining the wrong thing.

When you reach step five and rebuild the brief, use the new brief with the AI and compare the outputs. The interesting part isn’t whether the second output is better. The interesting part is what you had to articulate to get there. That articulation is the residue. Save it. Over enough loops, those pieces of articulated thinking are the structured thinking everyone says you should have.

The point of the exercise isn’t to fix one output. It’s to run the loop once with full attention, so the shape of it becomes yours. After that, you’ll start catching yourself running the prompt chase and have a choice you didn’t have before.

The Deeper Cut

The hardest part of the diagnostic loop is the pause itself. The chase runs at the speed of reflex, and the loop runs at the speed of attention. Bridging that gap reliably, every time, has been the real practice. So I’m building something to help.

Most AI tools aimed at developers do the opposite of what this edition argues for. They optimize the prompt. They suggest better phrasing, restructure the request, add missing constraints. They make the chase faster. The agent I’m building does the inverse: when an output disappoints, it walks me through the diagnostic loop instead of helping me write a better prompt. It asks where the gap lives. It pushes back when I try to skip step one. It makes the pause structural instead of optional.

It’s early. The first version is rough, useful enough for me, not yet ready for anyone else. But it’s the next building block in the program, and like every other artifact, paid subscribers will get access as it matures. The next few editions will follow the build: what the agent does, what I had to refine in my own thinking to make it work, where it surprised me, what it still gets wrong. The same diagnostic loop the edition describes, applied to building the thing that helps run it.

Discussion about this post

Ready for more?