I built 14,000 lines of code before talking to a single user. Here's what I learned.

10 weeks ago I was convinced I was building something nobody else had.

The idea: a GitHub App that reads every changed Python function in a PR, infers what it should always do — "the total should never be negative", "the result should never be None" — then attacks it with adversarial inputs in a hardened Docker sandbox. Only posts a comment when it has concrete proof something broke. No guessing. No noise. Just:

LogoMesh found 1 issue
Negative quantity bypasses checkout validation

Property: Order total should always be ≥ 0
I called: checkout(item_id=1, qty=-5)
Got: Order created with total -$49.95

207 unit tests passing. Docker sandbox with nobody user, --network=none, memory caps, randomized filenames. Time-Travel Trace that captures exact variable state at the crash frame. 14,000 lines of code.

Zero customer conversations.

Then I looked at the real numbers. 90% silence rate on real PRs. Of the findings it did post, most were sandbox artifacts — the tool generating tests that hit its own scaffolding, not real bugs. A validator that drops 86% of raw findings just to keep the noise low. p95 latency of 394 seconds against a 60 second target.

Classic mistake. Built first, validated never.

So I stopped and started talking to people. What I kept hearing wasn't "I need adversarial testing on my PRs." It was something simpler:

"When a prod bug hits I spend 30-45 minutes just writing the repro test before I can even start fixing."

Read the Sentry trace. Reconstruct the state. Write a test. Run it. Wrong inputs. Fix the test. Run it again. By the time it reproduces you've lost the whole debugging flow.

That's the actual pain. Not catching future bugs — dealing with the ones already in production.

So now I'm building something much simpler. Paste a Sentry URL, get a failing pytest that reproduces the exact crash, runs against your current branch, tells you "still reproduces" or "your branch fixed it." One command. Under a minute.

The engine I spent four weeks on isn't wasted - the Docker sandbox, the frame locals injection, the binary verdict reliability - those are exactly what make this work. The positioning just changed completely.

The question I'm still trying to answer:

Is the 30-45 minute manual repro step a universal pain or just my workflow being slow? Do you write a reproducing test before fixing a prod bug, or do you just read the trace, push the fix, and monitor?

Genuinely trying to figure out if I'm solving a real problem this time before writing another 14,000 lines nobody asked for.

DE

Source

This article was originally published by DEV Community and written by Oleksandr Voievodin.

Read original article on DEV Community

Back to Discover

Reading List