How to evaluate what AI gives you (because most devs don't)

AI writes code that compiles, runs, and passes tests. That's exactly why it's dangerous.

Mar 14, 2026

🤖 Building with AI · For developers rethinking how they work
This week’s challenge: take one piece of AI-generated output you accepted this week and find the assumption it made that you didn’t check.

Last month, my Slack agent had 269 passing tests. Every function worked. Every edge case I’d written a test for behaved correctly. The architecture was clean: dependency injection, separation of concerns, proper error handling. By every standard measure, the code was solid.

Then I used it for a week.

On the third day, I almost missed a critical conversation. A team member posted a question in a channel. The agent evaluated it, decided it didn’t need my attention, and moved on. Four hours later, three people had replied in that thread, one of them explicitly asking for my input. The agent never saw it. It had already dismissed the parent message, and it had no mechanism to re-evaluate a thread that grew after the initial assessment.

The code was correct. The tests passed. The tool still failed at its actual job: telling me what mattered.

That gap, between technically correct and functionally reliable, is where most developers stop paying attention. And it’s the gap where AI-generated work is most dangerous, because AI is exceptionally good at producing output that looks right.

The confidence problem

When a junior developer writes bad code, you can usually tell. The naming is off. The structure feels uncertain. There are signs: hesitation marks in the logic, inconsistent patterns, the kind of roughness that signals someone is still learning.

AI doesn’t do this. AI-generated code arrives with the confidence of a senior developer and the context awareness of someone who just walked into the room. It uses the right design patterns, follows consistent naming conventions, handles error cases, and produces something that reads like it was written by someone who knew exactly what they were doing.

This is what makes evaluation hard. The surface quality is high enough to bypass the instinct that tells you to look more closely. You read it, it makes sense, the tests pass, and you move on. The failure modes are hiding underneath, in assumptions that were never stated because the AI doesn’t know they exist.

I’ve seen this play out the same way multiple times now. I asked AI to review a piece of my code, and it came back with an analysis I couldn’t fault technically. Bottlenecks identified with precision, race conditions flagged with clear explanations of why they were dangerous, each one accompanied by a solution I would have been impressed to see in a senior engineer’s review.

Then I asked a simple question: under what conditions would these bottlenecks actually occur? The answer, once I pressed, was almost never. The race condition required a concurrency pattern my application doesn’t use. The bottleneck would surface under loads I’ll never see.

If I had accepted the review at face value, my code would have become more complex, harder to maintain, and more difficult for the next developer (or AI) to read, all to prevent problems that had almost no chance of happening. Wtf!!?

That’s the pattern my Slack agent exposed too. The failure wasn’t in any function. It was in a model of reality the AI had no reason to doubt.

Three layers of evaluation

The developers who use AI well have learned to evaluate at three levels, and most stop at the first one.

Does it work?

This is where almost everyone starts and stops. Run it, test it, check the output. If it compiles and produces the expected result, move on.

This layer catches syntax errors, logic bugs, and obvious failures. It’s necessary and insufficient. My agent passed this layer completely. Every function did what it was supposed to do. The problem wasn’t in any individual function. It was in the architecture’s model of reality.

Does it fit?

This is the layer most developers skip. The question isn’t whether the code works in isolation but whether it works in the system it’s joining. Does it match the existing patterns? Does it make the same assumptions the rest of the codebase makes? Does it introduce a dependency that creates problems elsewhere?

AI-generated code frequently passes the first test and fails the second. It solves the problem you described while quietly ignoring the constraints you didn’t mention. You asked for a caching layer, and it built one with an in-memory store, which works perfectly until your application runs across multiple servers. You asked for input validation, and it added thorough checks that happen to duplicate validation already handled by the middleware. Nothing broke. Everything is subtly wrong.

The fix isn’t to prompt better. The fix is to evaluate differently. Before accepting any substantial piece of AI-generated code, ask: what does this assume about the context it’s operating in? Then check whether those assumptions hold.

Does it survive real use?

This is the layer that only shows up over time, and it’s the one where the most consequential failures hide. Real use introduces conditions that no spec anticipated because the people writing the spec didn’t know those conditions existed.

My keyword-based search is a good example. When the agent gathered context for a trigger message, it searched other channels by extracting the longest words from the topic summary and running text matches. This worked for obvious connections. If someone mentioned “deploy pipeline” in two channels, the keyword search found both. But when someone in the engineering channel said “the deploy pipeline is stuck” and someone in the incidents channel reported “CI/CD timeout affecting production,” the search missed the connection entirely. Same issue, different words. The most valuable cross-references, the ones that actually saved me time, were exactly the ones it couldn’t find.

No amount of testing would have caught this. It required using the tool on real conversations over real days and paying attention to what it wasn’t surfacing.

The evaluation habit

Developers I mentor complain about this constantly. Reading AI output is exhausting. Not the act of reading itself, but the mental effort of understanding what was generated, interpreting the decisions behind it, and thinking critically about whether it actually makes sense. It demands real cognitive energy, and they didn’t sign up for that. They wanted AI to make their work easier, not to add a new layer of intellectual labor on top of it.

I get it. But here’s the thing: in the past, we wrote the code ourselves. It took days, sometimes weeks. Now AI writes it in minutes and we just need to read and think about what we’re reading. The intellectual part is still ours. The choice is yours. Do you want to delegate everything to AI and do nothing, or do you want to be recognized for your intellectual work and be more productive than ever?

The first option sounds comfortable. Nobody will pay well for it. There’s no career path in being the person who clicks “accept” without reading. Your best option is to learn how to think and use AI. And that starts with how you evaluate.

So here’s what the practice actually looks like.

The first pass is immediate: does this output match what I asked for? Read it carefully, not to verify that it runs, but to understand what it’s actually doing. Look for the decisions the AI made that you didn’t ask for: the data structure it chose, the error handling strategy, the assumptions about inputs.

The second pass is contextual: does this belong here? You’re looking at a diff, in git, in a PR, in whatever tool you use to review what AI changed. The code difference is right there. But understanding whether that change makes sense requires context the diff doesn’t show: who is this feature for, what problem does it solve, what did the team decide three weeks ago that shaped how this part of the system works. The AI wrote code that addresses your prompt. Whether it addresses the actual need behind the prompt is a question only you can answer, and only if you understand the context well enough to ask it. This is the pass where experience matters most, and where junior developers need to be most deliberate.

The third pass is temporal: does this hold up in the real world? Most developers never do this one, because most developers stop thinking about their code the moment it’s merged. What happens in production is someone else’s problem. That mindset was already limiting before AI. Now it’s dangerous, because the volume of code you ship with AI is higher than ever, and each piece carries assumptions you may not have examined.

This pass requires a product mindset. You’re no longer asking whether the code works. You’re asking whether it solves the problem for the person using it, under the conditions they actually face. And that means doing something most developers never do: go back and check.

A week after you ship, look at how the feature is being used. Talk to the people using it. Read the support tickets. Check whether the assumptions you accepted during the code review actually held up in production. If you built a search feature, are people finding what they need? If you built a notification system, are the notifications reaching the right people at the right time, or are they creating noise?

Most developers do the first pass automatically. The second is where the real skill lives. The third is the one that turns a developer into someone who builds products, not just features.

What this means for your career

The developers who will be most valuable in the coming years are the ones who can evaluate AI output at all three layers. The code generation is becoming commodity. The ability to determine whether that code actually solves the right problem, in the right context, under real conditions. That’s a judgment skill, and it compounds with every project you apply it to.

This connects directly to what we covered in Edition #4: the skills that don’t expire are the ones AI can’t perform for you. Evaluation is one of them. AI can generate the code. AI can even generate tests for the code. But AI cannot determine whether the code’s model of reality matches your reality. That requires understanding the domain, the users, the constraints, and the history. In other words, it requires everything you accumulate by paying attention to your work over time.

The irony is that AI makes this skill both more important and harder to practice. When the output arrives looking polished and professional, the temptation to accept it without deep evaluation is strong. Every time you resist that temptation and look harder, you’re training the judgment that makes you irreplaceable.

Try This

If you’ve been following the previous editions, you’ve built a set of artifacts: a delegation map (Edition #1), a judgment-versus-execution breakdown (Edition #2), a problem statement and learning target (Edition #3), and a decision analysis (Edition #4). Each one built on the last.

This week, the exercise is different. Instead of building a new artifact, go back to something you built recently: a piece of code, a tool, a workflow where AI contributed substantially to the output.

Run it through the three layers from this edition.

Layer 1: Does it work? You probably already checked this. Confirm it anyway.

Layer 2: Does it fit? Look at the assumptions the AI made about the context. What conventions does your codebase follow that the AI didn’t know about? What constraints exist in your system that the AI wasn’t told about? Write down at least two assumptions the AI made that you never specified.

Layer 3: Does it survive real use? If the code has been running for a while, think about the edge cases you’ve encountered. If it hasn’t, imagine the real-world conditions it will face. What happens when the data isn’t clean? When the user behaves differently than expected? When the load changes? Write down one scenario the AI couldn’t have anticipated.

By the end, you should have a short document with two hidden assumptions and one untested scenario. That document is the beginning of an evaluation practice. Run this on every significant piece of AI output, and over time you’ll develop the instinct to spot these gaps before they become problems.

The exercises from previous editions gave you a map of your work, a target for automation, and a framework for decisions. This week adds the quality layer: the skill that determines whether what you build actually holds up when it matters.

The Deeper Cut

There’s a pattern I notice in every developer I mentor who starts building with AI seriously. At first, they evaluate too little: they accept output at face value because it looks professional. Then they overcorrect and evaluate too much, spending more time reviewing AI output than it would have taken to write it themselves. Both extremes miss the point.

The calibration happens through volume. The more AI output you evaluate critically, the faster your instinct develops for where the failures tend to hide. After enough reps, you stop reading every line with equal suspicion and start knowing where to look, which is usually in the assumptions, not the syntax.

Paid subscribers get the evaluation checklist: a structured tool for running the three-layer evaluation on any AI output, with specific prompts for each layer and a format for capturing what you find. It’s the same process I run on my own tools. It turns the thinking from this edition into a repeatable practice.

Discussion about this post

Ready for more?