You Can't Vibe Code Past Your Own Engineering Judgment

What the Experiment Was
Where Engineering Judgment Showed Up
The Approval Gates Were Engineering Decisions
The Bug
What the Experiment Taught Me

There is a story being told about vibe coding right now that I find partially true and partially dangerous. The true part: you can build working software faster than ever before by letting an AI carry the implementation while you steer. The dangerous part: the implication that the steering requires less than it used to. My experience was the opposite. The AI matched my depth exactly. Where I brought sharp judgment, the output was sharp. Where I was loose, the output had a bug.

This is not a complaint about the tools. It is an observation about how they work, drawn from a small, deliberate experiment I ran — building DSForms, a self-hosted form endpoint, with Claude as a planning partner and Claude Code as the implementation engine. Small project, intentional scope. The right size to actually see what was happening rather than be overwhelmed by it.

One bug shipped. I want to start there, because that bug is the most honest thing in this post.

What the Experiment Was

I was using web3forms for form submissions on a static site. It worked fine except for one thing: the submission table truncated content. You could see that someone submitted something, but not what they actually wrote. For a SaaS dependency I was paying nothing for, complaining felt wrong — so I decided to build my own version. A single Go binary. SQLite. Docker Compose deployment. Self-hosted, email notifications, an inbox-style UI where I could actually read submissions.

The constraint I set for myself was to not open a code editor until the entire project was planned. No architecture-as-you-go. No "I'll figure out the schema when I get there." Before touching Go, I wanted to sit with Claude and think through the whole thing — the architecture, the database schema, the security model, the UI, the deployment — and produce documents that could serve as Claude Code's memory and specification across every development session.

This is not how I normally build things. I work with much larger systems day to day, where planning is distributed across teams and time. This was a deliberate sandbox: a small project where I could observe the dynamic clearly, without the noise.

Where Engineering Judgment Showed Up

The planning conversation with Claude lasted long enough to produce five documents. What surprised me was how much of the real work in that conversation was not prompting — it was pushing back.

The backup feature is the clearest example. The initial proposal was substantial: S3 uploads, SigV4 signing implemented from scratch, a cron parser, a scheduler, local file pruning logic. Technically coherent. Also completely wrong for what I was building. I said, plainly, that I was going to deploy this on Dokploy and I didn't understand the local backup idea. Two sentences. That collapsed the entire spec into two buttons — one that runs VACUUM INTO and streams a file download, one that validates and atomically swaps the database. The right answer was obvious once the deployment context was clear. But Claude did not have the deployment context. I did.

The security section followed a similar pattern. I raised the concern: this is open source, which means anyone can read the code and look for attack surface. Claude structured the response — token bucket rate limiting, login brute-force lockout, request size limits, security headers, all without external libraries. The intuition that something needed to be there came from me. The structure of what it became came from Claude. Neither half works without the other.

Even the UI decisions followed this pattern. Claude presented four visual directions as mockups. I picked one, rejected another by asking "why gold?", and the final design came from that back-and-forth. Taste is a form of engineering judgment. It shapes what gets built.

The consistent pattern across the entire planning conversation: Claude brought structure, options, and implication thinking. I brought constraints, reality checks, and decisions. When I was precise about what I knew and what I needed, the output was precise. When I was vague, Claude was thorough but not necessarily right.

The Approval Gates Were Engineering Decisions

Claude Code followed a session-based workflow where every session had two gates: approve the plan before any code is written, approve the merge before it lands on main. Thirteen sessions, two gates each.

In practice, these gates were the moments where engineering judgment had to show up or the process would drift. Approving a session plan is not rubber-stamping — it means reading the list of files to be created, the test names and what they assert, the order of operations, and deciding whether that matches what you actually want built. Approving a merge means reading what was produced and comparing it against what was planned.

Neither gate is heavy. Both are real. The gates kept me inside the process rather than just watching it from outside. That distinction mattered more than I expected. Building AI-generated code without those review points is exactly where the brownfield problems start — the code looks right, passes tests, and embeds assumptions you never examined.

The Bug

When everything was merged and deployed, I found that you could see the submissions table but could not open any individual submission to read the full content. The link was wired wrong. Claude Code had produced structurally correct code with an incorrect connection between the table row and the detail view — confident, clean, and broken.

This is the honest data point. I reviewed the merge. I approved it. The bug was there and I missed it. And what I missed it at was a UI wiring detail — exactly the kind of thing that looks correct when you read the template code and only reveals itself when you actually click the link in a running application.

The hard truth about AI and production code is that it can be confidently wrong in ways that are structurally indistinguishable from correct. This was a small example of that at the UI layer. In a larger, higher-stakes system, the same dynamic applies at every layer. The review has to be proportionate to the risk, and my review at that gate was not.

The bug got fixed. It was a small project. But I would be telling an incomplete story if I left it out, because it is precisely where the thesis lives: the output matched my level of attention. Where I was careful, things were right. Where I was loose, a bug slipped through.

What the Experiment Taught Me

The experiment worked well for the scope it was. One binary, thirteen sessions, one bug. The email reader — the feature that solved the original frustration — works exactly as designed. Submissions arrive, notifications fire, the inbox view shows everything. The thing I actually wanted to own, I own.

But the more durable lesson is about the nature of the tool rather than the output of this particular project. Vibe coding is not a skill bypass. It is a leverage multiplier. If your engineering judgment is good, it gets amplified. If it is absent or lazy at a particular moment, the amplification works in the other direction — you get more output, faster, with the flawed assumption baked in throughout.

The planning conversation required me to know what I was building well enough to push back when the proposal was wrong. The approval gates required me to read carefully enough to catch problems before they merged. The review that missed the bug was a reminder that the process only holds where the human holds it.

This was a small experiment, not a production system. I am not drawing conclusions about what this means for teams, or for larger codebases, or for developers with less experience. I am reporting what I observed in a controlled, low-stakes setting. The observation is this: the AI went exactly as deep as I did. No deeper, and no shallower. That is not a limitation of the tool. That is what the tool is.

If that is true in general — and I suspect it is — then the bottleneck in AI-assisted development is not the model. It is the engineering judgment you bring to the conversation.