The kind of bug your AI PR reviewer will never see

A lot of people are pointing their AI coding agents at pull requests right now and asking them to find bugs in the diff, and that is actually a pretty small return on token investment when you compare it to opening up the live app and just acting like a real user. The bugs your customers will hit when they actually go to use the thing are not the kind of bugs that show up in a diff, and almost nobody is pointing their agents at that work yet. That's where I think the real 10X is hiding right now, and I want to walk through how I do it and how I packaged the whole workflow as an open source Skill file you can use today.

How I learned this

I spent 12 years at Apple working across many different parts of the system, and one thing I came away with from that whole tenure is that finding the bugs your users would hit first comes from a repeatable workflow. You don't need to be a 10X engineer to do this. You just need the cadence.

I have been refining that cadence on my own projects ever since, and I have now encoded a version of it as a Skill file so anyone with an AI agent can run that same cadence on their own work.

Why real-user QA is where the bugs actually hide

The bugs your customers actually hit are a different shape from the bugs your linter or your AI PR reviewer catches. A linter is great at finding a missing null check or a typo or some naming that doesn't match the rest of the file, and an AI agent reading your diff is going to be even better at that, and a lot of teams are getting real value from pointing an agent at code review. But the bugs that end up costing you in support tickets and lost customer trust are things like a stale invite code sitting in sessionStorage from a flow somebody abandoned yesterday, which then silently joins them to the wrong group when they come back today and try to sign up normally. I actually shipped this exact bug on Mokuhoe, my outrigger paddler scheduling app, and it is part of what motivated me to build this Skill file in the first place. Other bugs in the same family include a mobile keyboard popping up and hiding the submit button so the user thinks the form is broken, or a race condition between auth and routing where the user lands on the right URL but with the wrong session state in their browser.

None of those bugs show up in a diff because the code technically works. They only show up when a real customer hits the app while the system is in a real-world state, and they are the bugs that almost always escape to production because nothing in your CI pipeline is actually behaving like a customer.

The cost-benefit on this is also pretty lopsided, because a bug you catch in a QA pass costs almost nothing to fix, and a bug that ships to customers costs you ten to a hundred times more in support time and lost trust and engineering hours spent chasing it down weeks later. So the work of acting like a real user is actually where the real value is sitting right now, and the reason this matters today is that AI agents can finally drive browsers. We have cursor-ide-browser and browser-use and Playwright MCPs and a bunch of other tools that let an agent open the live app and click around the way a real customer would, and the work that used to require a dedicated QA team plus somebody bridging Engineering and QA and Product can now run by one agent in 30 to 60 minutes per phase.

The Bug Review Board

The Skill file is called running-bug-review-board, and the Bug Review Board part, or BRB, is the cadence that turns the QA work from a pile of feedback nobody acts on into a system that produces shippable software.

Every QA pass wears three hats at the same time, and the agent runs all three in one pass. The Product Manager hat asks whether the build actually delivers the user-promise documented in the spec or the phase doc. The QA hat executes the test scenarios from a real user's perspective on the primary supported viewport, capturing evidence as it goes. The Engineer hat watches for invalidated assumptions, like a renamed component or a schema field that quietly appeared in the UI but didn't make it into the spec. What used to require multiple humans coordinating across a sprint now runs by one agent in a single pass, and finding the gaps is the whole point.

The flow looks like this:


1. Discover  → read the spec, the phase doc, and the open bugs
2. Plan      → derive scenarios from the spec and the phase gates
3. Execute   → drive the browser like a real user, capture evidence
4. File bugs → P0 / P1 / P2 with full reproduction steps
5. Verdict   → YES or NO sign-off, plus a handoff prompt for the next agent

The priority system is the part that turns a list of bugs into a shipping decision. A P0 means the bug blocks the core flow or causes data loss or bypasses auth, and that halts the phase until somebody fixes it. A P1 means a feature is broken or wrong but a workaround exists, and that blocks sign-off on the current phase. A P2 means cosmetic or accessibility or some dev console noise, and that gets deferred to a polish phase or a hardening release. The bug review process is actually the most instrumental part of this whole workflow, because severity combined with the timeline to ship is what gives the team clarity on what to focus on and what to defer to the next phase of work.

The cadence runs once per phase or sprint or build. The Skill file scaffolds the folder structure for you on first run if your repo doesn't already have one, so all the bug reports flow into docs/qa/bug-reports/ with a clean naming pattern and a status workflow that tracks each bug from open to in-progress to fixed to verified. And while engineering is fixing the open P0 and P1 bugs from this pass, the QA agent can already be running the next pass on the next phase, so the pipeline does not block waiting for one stream to finish before the other can move.

The output that stakeholders actually read is the verdict line at the top of the coordinator merge doc, and it reads either "Phase N ready? YES, all gates green, no open P0 or P1" or "Phase N ready? NO, here are the open blockers and here is a paste-ready prompt for the next QA agent." That verdict is what the team actually uses to decide whether the phase ships.

Why this ends up being 10X

The first reason this is a 10X multiplier in practice is that you start shipping with evidence behind every check. The question of whether you are ready to ship stops being something you guess at on a Thursday afternoon and turns into a verdict line backed by a bug index and a gates checklist, and stakeholders can read that verdict and trust it because there is evidence behind every checked box.

The second reason is that engineering and QA can run in parallel. A lot of teams treat QA as a phase that comes after development, which means the pipeline blocks every time the QA team gets behind. The AI agent running the BRB cadence changes that, because engineering can be fixing the open P0 and P1 bugs from this pass while the QA agent is already running the next pass on the next phase. Both streams move at the same time, and that is where the actual productivity multiplier shows up, and people may not realize this is how they could be working today.

The third reason is that deferring becomes something the team agrees to out loud. P2 bugs go into a known queue with a known status, the team commits to deferring polish until the polish phase, and you ship the things that matter on the timeline you committed to. Nothing is hidden in someone's head and nothing surprises the team a quarter from now.

The real magic underneath all of this comes from understanding the intent of what the app should do and how it is going to serve the customer, and by acting like an end user you find the highest-pain-point bugs before users see them. Most engineers miss those bugs because they are too focused on testing the code they wrote. Now an AI agent can run that cadence for you on any repo and any web app you point it at.

Try it

The Skill file is open source under the Apache 2.0 license, and it works with Claude Code, Factory Droid, OpenAI Codex CLI, Cursor, and claude.ai. The cleanest way to install it on Claude Code is two commands:


/plugin marketplace add RayFernando1337/rayfernando-skills
/plugin install running-bug-review-board@rayfernando-skills

Factory Droid and Codex CLI 0.126+ have similar marketplace install commands, and the README has the full per-vendor matrix at github.com/RayFernando1337/rayfernando-skills. If your tool doesn't have a native plugin command yet, you can install cross-vendor with npx skills add from anywhere. And for claude.ai, you can download the release zip from v0.1.0 and upload it via Settings → Features → Skills.

Once it's installed, you can just say something like:

"QA this app. Run a manual test plan and tell me what's broken."

The Skill file will scaffold docs/qa/ if you don't already have one, walk through the discover-plan-execute-file-verdict flow, and hand you back a coordinator merge doc with the YES or NO verdict right at the top.

The repo is at github.com/RayFernando1337/rayfernando-skills, and I would love to hear from you if you run it on your own project and it catches something interesting.