AI engineering metrics: stop counting pull requests
AI engineering metrics should show whether changes stick, break, or reach users with bugs. Track accepted change rate, rollback rate, and escaped defects.

Why pull request counts stop helping
A coding assistant can turn one feature into six tidy patches before lunch. One pull request adds tests, another renames functions, another fixes lint, another updates prompts, and two more handle edge cases. The count climbs fast, but users still get one small change.
That makes pull request metrics a weak stand-in for progress. In teams using AI tools, the tool often shapes the work. It's easy to produce lots of clean, reviewable diffs without shipping much that matters.
The problem gets worse when managers reward volume. People start slicing work into smaller PRs because the scoreboard favors activity. A week with 18 merges can look better than a week with 4, even if the second week shipped the stronger release and caused fewer incidents.
Small PRs are not the problem. They often make review easier. The problem starts when the count becomes the target.
A few distortions show up fast:
- One task turns into several PRs because the AI tool suggests separate cleanup passes.
- People merge low impact edits to keep numbers moving.
- Review time goes to repetitive diffs instead of risky changes.
- Release quality drops because the team chases merges instead of stable releases.
Reviewers usually feel this first. Every extra PR needs a title, context, comments, checks, and a decision. After the fifth similar diff, people skim. That's when a real issue slips through: a weak test, a hidden dependency, or a migration that looked harmless.
The pattern is easy to miss on a dashboard. A developer uses AI to update an onboarding flow, and the tool produces eight PRs in two days. Six are cleanup and prompt tweaks. One fixes a test. One changes the actual flow. The chart says output jumped. Customers notice almost nothing until the final PR ships.
That's why AI engineering metrics should focus on accepted change rate, rollback rate, and escaped defects. Those numbers tie work to shipped results and stability, not just motion in the repo.
What to measure instead
Measure work that survives contact with production. In teams using AI tools, code can appear quickly and in large amounts. The better question is simple: did the change stay, did the release hold up, and did users run into problems?
Accepted change rate answers the first part. It shows how much merged work still remains in use after a set period, such as 7, 14, or 30 days. If a team merges lots of AI-assisted changes but many get removed, replaced, or quietly disabled soon after, the output looked strong while the progress was weak.
Rollback rate answers the second part. It tracks how often a release causes enough trouble that the team reverts it, turns it off, or rushes out a fix. A busy release calendar means little if every third release needs emergency cleanup.
Escaped defects answer the third part. These are bugs users, support, or production monitoring catch after release. They show what slipped through review, tests, and internal checks. Users never care how many pull requests you closed.
You need all three numbers together. Accepted change rate shows whether shipped work sticks. Rollback rate shows whether releases create trouble for the team. Escaped defects show whether that trouble reaches users.
Any single number can mislead you. A team may have a high accepted change rate because it avoids risk and ships tiny edits. Another team may have a low rollback rate because it barely releases anything. Escaped defects add the pressure test: what happened once real people touched the product?
Picture a team that merges 80 changes in a month. After two weeks, 62 still remain in production. Two releases get rolled back. Five user-facing bugs appear after launch. Now you have something you can act on. You can inspect why 18 changes did not stick, what caused the rollbacks, and where tests missed real behavior.
That tells you far more than a report that says the team merged 200 pull requests.
How to define accepted change rate
Accepted change rate tells you how much shipped work actually sticks. In teams using AI tools, that matters more than raw pull request metrics. A team can merge a lot of code and still create churn if half of it gets rolled back, patched, or replaced a day later.
Keep the definition plain. Count a change as accepted only if it reaches production and stays there for a fixed period without being reverted or quickly overwritten. If you skip the waiting period, the metric turns into another merge count with better branding.
Pick the time window before you publish the number, then leave it alone. For many product teams, 7 to 14 days is a reasonable start. Seven days works when you deploy often. Fourteen days is safer when releases move more slowly or defects show up later.
A workable rule is simple:
- Start with merged changes that reached production.
- Wait for the agreed window to pass.
- Remove anything the team reverted, hotfixed away, or replaced because it failed.
- Divide what remains by the total production changes in that period.
If 100 changes reached production in a month and 82 were still live after 14 days, the accepted change rate is 82%.
Be strict about what counts as "quickly replaced." Use one rule, not weekly judgment calls. If another change removes or rewrites the shipped behavior within the same window because the first version did not hold up, treat the original as not accepted. Cosmetic follow-up work is different. Write the rule down once and keep it boring.
Use the same definition across every repo and team. That includes app code, backend services, internal tools, and automation around AI workflows. If one team uses a 7-day window and another uses 30 days, comparison stops being useful. Consistency matters more than a perfect formula.
This metric becomes useful when leaders treat it as a quality signal, not a score to chase. Teams should feel safe shipping less if the work that lands actually stays put.
How to track rollback rate
Start with a rule the team can follow every time. A rollback is any shipped change that you had to remove, undo, or switch off because it caused harm in production. If people argue about edge cases every week, the number will drift and the metric will turn into noise.
Most teams should count these as rollback events:
- a code revert after release
- a hotfix that removes or disables the new behavior
- a feature flag shutoff for a bad release
- a config change that restores the old behavior
- a partial undo of one part of a larger deploy
Use one denominator and keep it stable. The cleanest version is rollback events divided by accepted changes in the same period. If you review it weekly, use the same weekly window every time. If one deploy contains twenty AI-assisted edits, do not count that as twenty rollbacks unless you actually reversed twenty accepted changes.
It also helps to split full rollbacks from partial ones. A full rollback means the team backed out the whole change. A partial rollback means some of it stayed live and the broken part got turned off. That distinction matters. Partial rollbacks often point to weak release boundaries, mixed batches, or feature flags that arrived too late.
For each rollback, log one short reason. Keep the list small so people use it the same way: wrong business logic, test gap, bad prompt or generated code, integration issue, or performance and reliability problems.
That reason log matters more than it looks. A rising rollback rate gets your attention, but the cause tells you what to fix. If most rollbacks come from generated code touching old payment logic, you need tighter review there, not a lecture about developer speed.
Watch for spikes after rushed batches of AI-assisted changes. A small team using AI tools can ship a week's work in a day, then spend the next two days turning off flags and restoring old behavior. Count those events, mark the cause, and review them together. That's when rollback rate becomes a real signal instead of a vanity number.
How to track escaped defects
Escaped defects are bugs found after release. This number keeps teams honest. AI can help a team produce more code, but customers still judge the product by what breaks in production.
Count only defects reported after release. Include reports from users, support, sales, and account teams. If an engineer catches the bug before customers touch it, leave it out. Mixing pre-release and post-release bugs makes the number noisy.
A single count is not enough. Group defects by severity so small annoyances do not bury serious failures:
- Severity 1: data loss, security trouble, payments failing, or the product cannot complete a main task
- Severity 2: a feature works badly, but people can still finish the task with effort or a workaround
- Severity 3: minor UI bugs, wrong text, cosmetic issues, or small annoyances
This keeps a bad week from looking harmless because the dashboard says "12 bugs" without context. One payment failure should never look equal to eleven typo fixes.
When you log a defect, link it to the release, deploy window, ticket, or change set that most likely introduced it. Perfect attribution is rare, especially if you ship many times a day. Consistency matters more. If the team can usually trace a production bug back to one release or commit range, patterns show up quickly.
Set one rule for duplicates and stick to it. Count the defect once, then record how many reports came in. Six customers hitting the same billing bug is still one escaped defect, but the report count tells you how wide the damage spread. Do the same for false alarms. If support reports a bug and the team confirms expected behavior, close it as a false alarm and keep it out of the metric.
A simple log with five fields is enough: date found, severity, source of report, suspected release, and status. That gives you a clean weekly view. Over a month, you will see more than how many bugs happened. You will see which releases create trouble, which issues reach customers, and whether the team is fixing the right problems.
A simple setup you can start this month
You do not need a large dashboard project to get useful numbers. Pick one team, one product area, and one monthly report. That is enough to test whether your metrics help decisions or just create more debate.
Write the rules before anyone touches a chart. If you skip that step, people will argue about edge cases every week and stop trusting the report. Keep the rules short, plain, and stable for at least one month.
A practical first setup looks like this:
- Pull change data from your version control system.
- Pull deploy and rollback events from deployment logs.
- Pull escaped defects from your bug tracker or support queue.
- Put the numbers in one shared sheet or short report.
- Review them once a week for 15 minutes.
For many teams, version control, deploy records, and bug tickets already cover most of the work. If your stack already includes GitLab, CI logs, and Sentry, you probably have enough raw data to start without buying anything new.
Keep the report small. One page is enough. Track accepted change rate, rollback rate, and escaped defects, plus a short note for anything unusual such as a risky release, a holiday week, or a major refactor.
Keep the meeting short
A weekly review should not turn into a blame session. Look for movement, not excuses. If rollback rate jumps for two weeks, ask what changed in release practice, test coverage, or prompt to code review.
Do not overreact to one ugly week. Teams using AI tools often ship in bursts, and that can skew pull request metrics even more than before. Trends across six to twelve weeks tell a much cleaner story than any single spike.
A simple example makes the point. Say a five-person team ships more code after adding AI coding tools, but accepted change rate drops and escaped defects rise. That usually means the team is producing faster than it can review, test, or stage releases. You can act on that. A raw pull request count would miss it.
Start plain, keep the definitions fixed, and give the trend a month or two to settle. Boring measurement beats flashy charts every time.
A realistic example from a small team using AI tools
A five-person product team added AI coding tools to its daily work. Two engineers used code generation for routine tickets, one used it for test scaffolding, and the team started shipping much faster. In the first full month, they merged 94 pull requests instead of their usual 48.
At first, that looked like a win. The dashboard showed more activity, more merged work, and shorter cycle time. Then the week after release got messy.
Support saw more customer complaints. A settings page broke for some users. A billing fix needed two patches. One release introduced a bug that forced the team to roll back part of the deployment the same day. By the end of the month, the team had shipped almost twice as many pull requests, but rollback rate had climbed from 3% to 11%.
The bigger problem showed up a few weeks later. Many of those fast changes did not last. Engineers kept revisiting them, replacing them, or backing them out after edge cases appeared. Their accepted change rate over 30 days dropped from 79% to 62%.
Escaped defects moved the wrong way too. The team usually saw about 5 production issues per month that testing missed. After the push to ship more, that number rose to 14. Support felt it first. Engineers felt it next, because they spent more time fixing live problems and less time building new work.
They changed course without dropping AI tools. They slowed down merges, used smaller batches, and required human review for code that touched payments, auth, and data deletion. They also tightened tests around the parts that had broken most often.
Six weeks later, the picture looked better. Pull requests settled at 63 for the month, rollback rate fell to 4%, accepted change rate recovered to 81%, and escaped defects dropped to 6.
That team did not need fewer tools. It needed better AI engineering metrics. Once it stopped praising raw pull request counts, it could see which changes actually stayed live and which ones only created more work.
Mistakes that distort the picture
These metrics help only when you keep them boring and consistent. Most reporting problems come from how teams count, compare, and react to the numbers.
Changing the definition in the middle of a quarter ruins the trend. If accepted change rate means "merged to main" in April and "shipped to production and stayed" in May, the chart stops telling a clear story. Freeze the definition for the whole reporting window. If you need a new definition, start a new baseline and mark the break.
Bad comparisons create the next mess. A team that ships ten small updates a day will look very different from a team that releases once a week in larger batches. Their rollback rate and escaped defects do not mean the same thing, even if the labels match. Compare each team to its own recent history first. Compare teams to each other only when release style, product risk, and change size are close enough.
Treating every defect as equal bends the picture too. A broken checkout flow and a typo in a settings screen should not carry the same weight. Count severity levels separately, or add a simple impact score based on customer pain, support effort, and time to fix.
Using these numbers to rank individual engineers poisons them fast. People start slicing work into odd shapes, avoiding risky but necessary changes, or pushing defects into someone else's queue. AI engineering metrics work better when they describe team health and delivery quality, not personal worth.
Context matters. If support tickets jump after a launch, or the product team ships a large workflow rewrite, the numbers will move. That does not always mean the team got worse. A metric without product and support context is just a number on a slide.
When the graph changes, add a short note beside it. Say what changed in the product, release process, customer load, or support volume. That simple habit prevents a lot of bad decisions.
Quick checks before you report upward
Bad metrics get worse as they move upward. By the time a weekly report reaches a founder, CTO, or board, a vague number can turn into a wrong judgment.
Start with plain language. If one manager says accepted change rate means "merged code" and another says it means "code that shipped and stayed," the metric is already broken. Each number should fit in one short sentence that any team lead can repeat without footnotes.
Dates need to line up too. If a release went out on Tuesday, users reported a bug on Wednesday, and the team rolled back on Wednesday night, the dashboard should show the same story. When deploy dates, rollback dates, and bug dates drift apart across tools, people argue about the timeline instead of fixing the process.
A good report should help you spot a bad release in seconds. You should not need to read thirty pull requests to notice that accepted change rate dropped, rollback rate jumped, and escaped defects rose after one deployment window. If the signal appears only after manual digging, the dashboard is too weak.
The numbers also need a short feedback loop. Teams should review them and change something within one sprint. That change might be smaller releases, tighter test gates, or a pause on generated code in one risky area. If nobody acts, the report is just decoration.
One more check matters: do not turn this into a scorecard for individual developers. Teams using AI tools can produce a lot of code very quickly, and that makes personal output metrics even more misleading. These measures describe release health, not who worked harder.
A single engineer with strong prompts or agents can open many pull requests and still create a mess for operations. Another engineer can ship fewer changes, avoid rollbacks, and keep defects low. The second person usually helps the business more.
If your report shows what shipped, what held up in production, what escaped, and what the team will change next sprint, it is ready to send.
What to do next
Start with one product area, not the whole company. Pick a flow that ships often and gets real user traffic, then collect 30 days of data before you change anything. That baseline matters because most teams argue from memory, and memory is usually wrong.
Keep the setup plain. You do not need a new dashboard project or a long policy document. You need a shared way to count accepted change rate, rollback rate, and escaped defects the same way every week.
A simple first pass works well:
- Choose one product area with frequent releases.
- Record the three metrics for 30 days with the current process.
- Review the numbers in one meeting with engineering, product, and support.
- Change one habit at a time, then watch the next 30 days.
That review meeting matters more than most teams expect. Engineering can explain where generated changes sped things up or created cleanup work. Product can say whether accepted changes actually moved the roadmap forward. Support can show where escaped defects hit real users, not just internal QA.
Once patterns appear, change habits based on those patterns. If accepted change rate is low, tighten review prompts, narrow task size, or ask for smaller merges. If rollback rate climbs after fast releases, shrink the batch size or add one extra check before deploy. If escaped defects keep slipping through, fix the test gap instead of blaming the model.
That is where AI engineering metrics start pulling their weight. They stop being a scoreboard and start acting like feedback. A team that ships fewer pull requests but lands more accepted changes with fewer rollbacks is doing better work.
If your team is too close to the problem, an outside review can help. Oleg Sotnikov at oleg.is works with startups and smaller companies as a Fractional CTO and advisor, including AI-augmented development workflows and lean delivery systems. In practice, this kind of cleanup usually starts with clearer definitions and calmer release habits, not more process.
Frequently Asked Questions
Why is pull request count a bad progress metric?
Pull request count measures repo motion, not shipped value. AI tools can split one task into many neat diffs, so the number rises even when users get one small change. When teams chase that number, they merge low-impact edits and waste review time on noise.
Are small pull requests still a good idea?
Yes, small PRs still help. They make review easier and lower risk. Trouble starts when people optimize for the count instead of the release, because then they split work for the scoreboard rather than for clarity.
What does accepted change rate mean?
Accepted change rate is the share of production changes that still stay live after a fixed window. If a change gets reverted, hotfixed away, or replaced because it failed, do not count it as accepted.
What time window should we use for accepted change rate?
Start with 7 days if you deploy often. Use 14 days if problems show up later or releases move slower. Pick one window before you report the metric, then keep it the same across teams and repos for a while.
What counts as a rollback?
Count any shipped change you had to undo or switch off because it caused harm in production. That includes a revert, a feature flag shutoff, or a config change that restores the old behavior.
How should we track escaped defects?
Count bugs that people find after release, not bugs the team catches before launch. Group them by severity and log the source, suspected release, and status so you can spot patterns without arguing over every ticket.
Do we need new tools or a big dashboard project?
Usually no. Most teams already have enough data in version control, deploy logs, bug tickets, and monitoring. Put the numbers in one shared report, review them weekly, and keep the rules fixed.
Should we use these metrics to rank engineers or compare teams?
No. Use these numbers to judge team release health, not personal worth. Different teams ship with different batch sizes, risk levels, and release styles, so blunt comparisons often push people toward the wrong behavior.
What if PRs go up but accepted changes go down?
That usually means the team ships faster than it reviews, tests, or stages releases. Slow the merge pace a bit, cut batch size, and add human review around risky areas like payments, auth, or data deletion.
What is the simplest way to start this month?
Pick one product area that ships often and gets real user traffic. Track accepted change rate, rollback rate, and escaped defects for 30 days, review the trend with engineering, product, and support, then change one habit and watch the next month.