Nov 25, 2025·8 min read

AI delivery forecasts that match review and test reality

AI delivery forecasts fail when teams ignore review queues, flaky tests, and retry churn. Learn how to estimate with real production drag.

Table of Contents

Why estimates look fast and ship slow

A task can look finished the moment an AI tool produces a working draft. That is the first gap. Code written is not code shipped.

Teams often feel fast on day one, then lose time in review comments, failed checks, rewrites, and another round of approvals. The code exists, but the team still has to prove it belongs in production.

AI changes where the time goes. It can shrink drafting time hard. A feature that once took two days to sketch might appear in two hours. The later steps usually do not shrink much. Senior engineers still review risky changes. Tests still fail for strange reasons. Someone still decides whether the code is safe, readable, and worth merging.

That is why estimates drift. People count the part they can see on screen and skip the queues around it. They estimate coding time and forget the waiting time between attempts.

The missing time usually comes from a few places:

review backlog when one or two people approve most changes
flaky tests that fail without a real product issue
retry churn when a fix breaks something else, or the next AI draft creates new review work
release checks, staging checks, and cleanup work that never make it into the estimate

A normal sprint makes this easy to spot. One engineer asks an AI assistant for a new endpoint, gets a usable draft before lunch, and thinks the task is almost done. By Friday, the team has spent more time on comments, reruns, and retries than on the draft itself. On paper, the work looked faster. In production, it took the same time or even longer.

A good forecast does not ask how fast code appears. It counts the full path from draft to release, including the friction that shows up after the code compiles. When a forecast includes review load, test noise, and retry loops, it starts to match the way software actually ships.

Where the extra time hides

When AI helps a team write code faster, the calendar usually does not shrink at the same speed. A first draft may appear in minutes, but shipping still depends on reviewers, test runs, deployment gates, and approvals.

Most teams count build time and coding time. They miss the waiting time between steps. A pull request can sit for four hours before anyone opens it. Then the reviewer asks for changes, the author updates the code, and the same review starts again.

Review work is not one block of time. It usually has three parts: time in the queue, time spent reading and commenting, and time spent fixing what the reviewer found. If AI lets developers produce more changes each day, review load often grows faster than expected. The team wrote code faster, but the reviewer still reads at human speed.

Flaky tests hide another delay. A failing check that passes on the second run still costs time. Someone has to inspect the failure, decide whether it is real, rerun the pipeline, and wait for the result. One unstable test can turn a 15 minute check into 40 minutes of elapsed time without changing a single line of product code.

The same pattern shows up later on the path to production:

CI jobs fail for reasons no one trusts, so people rerun them
deployments time out and need another attempt
approval steps pause because the right person is busy
small reviewer comments trigger another commit and another full pipeline

Each delay looks minor on its own. Together, they can add half a day to a small change and a full day to a larger one.

Good AI delivery forecasts treat this hidden work as normal work, not as bad luck. If a team usually needs one review cycle, one rerun of unstable tests, and one approval nudge, the estimate should include all three. Otherwise the plan says two days, the team ships in four, and everyone acts surprised.

A simple rule helps: measure elapsed time, not only direct work. A task may need 90 minutes of focused effort, but if it spends six more hours waiting in review, CI, and approval queues, the forecast should reflect the full path to done.

How to count review load

Review time is part of delivery time. Teams often ignore it because nobody writes code during that gap, but the work stops there all the same.

This gets worse when AI writes the first draft fast. A change can look 'done' by lunch, then sit in a pull request queue until the next day.

Start with recent pull requests, not memory. Look at the last 20 to 30 normal changes from your team. Skip odd cases like emergency hotfixes, giant rewrites, or work blocked by vacations.

For each pull request, note four numbers:

time from opening the pull request to the first real review
number of review rounds before approval
time spent on small fixes after comments
time spent on larger changes after design or architecture feedback

The first number matters more than many teams expect. If the average wait before first review is 18 hours, that delay belongs in the forecast even if the actual review takes 12 minutes.

Then count comment rounds. One round is common. Two is still normal. Three or more usually means the team missed something earlier, or the change is too large. Use your recent average instead of an ideal target.

Keep small fixes separate from broader feedback. Small fixes are things like a naming change, one missing test, or a quick refactor. Broader feedback asks for a new approach, a split into smaller parts, or a change in data flow. Those comments can add half a day or several days, so they should not sit in the same bucket.

A simple forecast looks like this: coding time + average wait for first review + average small fix loop + design rework rate. If only one in five pull requests gets broader design feedback, count that as partial risk, not a full delay on every change.

This is where paper speed meets production reality. Fast code generation helps, but review load still depends on reviewer time, trust in the change, and how much the team asks reviewers to judge in one pass.

If the numbers feel surprisingly high, they are probably closer to the truth than the first estimate was.

How to count flaky tests and retry churn

A test suite can look healthy on a dashboard and still waste hours every week. Teams feel this when a branch goes red, nobody changed the related code, and the same pipeline passes on the second or third run.

That lost time belongs in the estimate. If you leave it buried inside CI reports, the work looks faster on paper than it feels to the people shipping it.

What to measure

Start with one simple rule: count failures that happen without code changes. When the same commit fails, then passes on rerun, you likely found test flakiness, not a product bug.

Track four numbers for each team, each week:

how many pipelines needed at least one rerun
how many total reruns the team triggered
how long one rerun usually takes, including waiting time
how many failures turned out to be flaky instead of real defects

Keep flaky failures separate from real bugs. A real bug needs debugging and code changes. A flaky test usually needs reruns, log checks, and a short discussion about whether the failure is safe to ignore. Those are different kinds of delay.

A small example makes the math clear. Say a team runs 40 pipelines in a week. Twelve need reruns. The team triggers 22 reruns total, and each one costs 9 minutes between queue time, test time, and checking the result. That is 198 minutes, or more than 3 hours, gone before anyone fixes a real issue.

Put retry time into the feature

Now push that time back into delivery planning. If a sprint includes five similar features, do not hide those 3 hours inside a generic CI number. Spread the retry cost across the planned work.

If flaky tests hit about 30 percent of changes, add that delay to the feature estimate. For example, if a change usually needs one review cycle and has a one in three chance of a 9 minute rerun, the estimate should carry that extra time. Forecasts get more honest when they include this friction instead of assuming every generated change moves through tests once.

This also helps teams avoid the wrong fix. If rerun time keeps growing, the problem is not 'slow developers.' The problem is test instability. Count it, separate it, and price it into the work until the team removes the flaky tests.

How to build a forecast step by step

Audit Your Review Queue

Get help measuring first review wait, comment loops, and rework time.

Get Review Help

Good AI delivery forecasts start with the task as human work first. If a change would usually take two days without AI help, write that down before you subtract anything. That keeps the estimate tied to real scope instead of tool optimism.

Then measure only the part AI actually speeds up. For many teams, that is the first draft of code, tests, or docs. It rarely removes review, debugging, waiting, or release friction.

Write the baseline size for the task with no AI help.
Estimate draft time with the tools your team uses today, not the tools you wish you had.
Add review time from recent team history, including comments, rework, and second passes.
Add time for test reruns, flaky failures, fixes, and deploy retries.
Record two outcomes: best case and likely case.

The review step matters more than most teams admit. If one pull request usually gets two rounds of comments and sits half a day before someone checks it, that delay belongs in the forecast. AI can write code fast, but it cannot make a teammate review it sooner or make a risky change easier to approve.

Treat test instability the same way. If your CI often fails for reasons unrelated to the change, count that drag. A team that reruns tests twice per merge and spends 20 minutes checking false failures should add that time every time. The same goes for deploy retries, rollback checks, and small fix commits after staging.

A simple formula works well: baseline scope, reduced draft time, plus review time, plus retry time. Then split it into two lines. Best case assumes reviews land fast and tests behave. Likely case uses your normal review pace and your normal failure rate.

That second number is usually the one you should promise. It looks less exciting on paper, but it is much closer to what ships.

A simple example from a normal sprint

A small feature often looks cheap on paper. Say the team adds one API change to return a new delivery_window field, then adds a small UI update that shows that window on the order page.

An AI coding tool can make the first draft feel almost done. It writes the API handler, updates the schema, adds a React component, and even suggests tests in about 40 minutes. After a quick cleanup by the developer, the work looks like a one day task.

The paper estimate might look like this: 3 hours for coding, 1 hour for tests, 1 hour for review, and 1 hour for merge and deploy. That is 6 hours total.

Then normal team friction shows up.

The first review comes back with one real issue and one style request. The reviewer asks for a permission check on the API field and wants the UI label changed because support uses a different term. The code change is small, but the review cycle still costs about 75 minutes between reading comments, fixing the code, pushing again, and waiting for the next pass.

CI then fails twice on a flaky browser test that has nothing to do with the feature. Each rerun takes about 15 minutes of wall time, plus a few minutes to inspect the logs and decide that the failure is noise. That is roughly 40 minutes of retry churn, and none of it moves the feature forward.

One more issue appears after the second review. On a narrow screen, the new delivery badge wraps badly and pushes the action button down. The fix takes 30 minutes, then the team runs the UI tests again.

Now the timeline looks different:

AI draft and cleanup: 1 hour
review cycle and rework: 1 hour 15 minutes
two flaky test reruns: 40 minutes
small UI rework: 30 minutes
final test pass, merge, deploy: 1 hour

The task still looked like a 6 hour job. The shipped timeline was closer to 8 hours 25 minutes, and that assumes nobody switched context while waiting on CI or review. That gap is why forecasts need to count review load, flaky test impact, and retry churn instead of draft speed alone.

Mistakes that break the forecast

Bring AI Into Production

Get advisory on AI coding, code review, and CI for your team.

Start Advisory

Bad forecasts usually fail for a simple reason: teams count code generation as progress, but they do not count the work that follows. AI can produce a first draft in minutes. Review, fixes, test reruns, and cleanup still take hours or days.

That gap gets worse when people treat AI output as finished work. A generated pull request may look 80 percent done, yet the last 20 percent can hold most of the delay. One unclear query, one wrong edge case, or one security concern can send the task back through review again.

Another common mistake is using one unusually smooth week as the baseline. Maybe the team had small tickets, a quiet review queue, and no broken tests. That week feels great, but it does not describe normal delivery. Forecasts need a median week, not a lucky one.

Test instability often gets buried inside a vague 'engineering overhead' bucket. That hides the real cost. If a suite fails for random reasons twice a day, the team burns time checking logs, rerunning jobs, and rebuilding trust in the signal. Count flaky test impact on its own, because it behaves differently from normal build time.

Review capacity breaks forecasts just as often. A small team can write five tasks in parallel, but if one senior engineer reviews everything, the queue becomes the schedule. Promising dates before that queue clears is a quick way to miss them.

Mixed task data causes another quiet failure. Bug fixes, new features, infrastructure changes, and prompt tuning do not move at the same speed. If you average them together, the forecast looks neat and means very little.

Warning signs usually appear early:

tasks look done in development but sit in review for days
the team reruns the same test job several times per ticket
estimates assume the same pace for bugs and brand new features
sprint plans depend on one reviewer staying fully available

A normal example is a two day feature that AI drafts in half a day. The team then spends a day on review comments, half a day rerunning flaky tests, and a few more hours on merge conflicts because the branch sat too long. On paper, the work looked faster. In production planning, it took three days, not two.

Quick checks before you commit

Strengthen Your Delivery Process

Bring in senior help for hard reviews, risky changes, and AI rollout.

Work With Oleg

A forecast can look fine in a ticket and still miss the sprint once reviews, reruns, and cleanup start piling up. Before you share a date, pause for five minutes and check the work around the work.

Start with review capacity. If the task needs one senior reviewer and that person already has six open pull requests, the estimate is already in trouble. Count reviewer time the same way you count coding time. A small change can wait two days if the right reviewer is busy.

Then check the test suite this task will touch. Do not ask whether tests usually pass. Ask for the flake rate on this part of the suite. If the API tests fail one run out of eight for no product reason, that noise will eat time and confidence.

A short checklist helps:

check who can review the change this sprint, and how many reviews they already owe
check the recent flake rate for the exact tests this work will trigger
check who owns reruns, log digging, and retry cleanup when tests fail
check whether the change touches risky code such as billing, auth, migrations, or deployment scripts
check whether the likely case, not the best case, still fits the sprint

Ownership matters more than many teams admit. When nobody owns reruns, everyone loses small chunks of time. One engineer retries CI, another scans logs, then the author comes back later to clean up a messy branch. Fifteen minutes here and twenty there can turn a one day task into three.

Risky code needs a wider buffer. A change near login, payments, or infrastructure often pulls in extra review, manual checks, and cautious rollout steps. Oleg Sotnikov often makes this point when he talks about software architecture and delivery systems: the cost is rarely the code alone, but the queue around it.

If the likely case no longer fits the sprint, say so early. A smaller promise that ships on time is better than a fast estimate that only worked in a clean imaginary run.

What to do next

Start with one team and one month of recent delivery data. Use work that actually shipped, not work that people planned on Monday and dropped by Friday. Count review rounds, time waiting for approval, failed test runs, reruns, and reopened pull requests.

That small sample is usually enough to make forecasts more honest. You do not need a long history before the pattern shows up. Most teams see the same problem quickly: coding looks fast, but review load and test noise keep stretching the path to release.

Keep the model small enough that someone updates it every week without a meeting. If the forecast needs a dashboard, custom scripts, and a long explanation, the team will stop using it. A plain table is enough for the first version:

original estimate
review rounds per task
failed CI runs
retry time
actual ship date

Use the numbers to fix the work, not only to defend dates. If reviews add two extra days every sprint, change who reviews what and how much work enters review at once. If flaky tests keep forcing reruns, clean the unstable suite before you ask people to move faster. If retry churn rises after more AI assisted coding, tighten the prompts, add better checks before review, or break work into smaller pieces.

A rough model that the team updates every week beats a perfect model that nobody touches. One team may learn that a single reviewer creates a queue every Thursday. Another may find that one unstable integration test burns three hours a week. Those are plain findings, but they affect delivery more than most planning rituals.

If you need outside help, keep it practical. Oleg Sotnikov shares this kind of AI first engineering and Fractional CTO work on oleg.is, with a strong focus on delivery systems, infrastructure, and automation that teams can actually run. The useful outcome is simple: forecasts that match real shipping work instead of draft speed alone.

Frequently Asked Questions

Why do AI estimates look fast but still ship late?

Because AI shortens draft time, not the full path to release. Review queues, test reruns, approvals, and cleanup still take the same human time unless your team fixes those bottlenecks too.

What should I include in an AI delivery estimate?

Start with the full elapsed path: draft time, wait for first review, review rounds, rework, CI reruns, staging checks, merge, and deploy. If your team hits those delays often, put them in the estimate by default.

How do I measure review load without guessing?

Pull 20 to 30 recent normal pull requests and measure real review behavior. Note how long each PR waited for first review, how many rounds it needed, and how much time the author spent fixing comments.

How do I account for flaky tests in the forecast?

Count reruns on the same commit that fail and then pass without code changes. Track how often they happen and how much wall time they burn, then spread that cost across similar work in your sprint plan.

Should I give a best-case estimate or a likely-case estimate?

Use two numbers. Best case assumes quick reviews and clean test runs, while likely case uses your normal review pace and normal failure rate. Promise the likely case if you want dates that hold up.

Can a small feature still take much longer than the AI draft?

No. Small changes still hit the same queue: review, CI, approval, and deploy. A tiny feature can lose hours if one reviewer is busy or one flaky test keeps turning the branch red.

What warning signs show that our forecast is wrong?

Watch for tasks that look done in development but sit in review, tickets that trigger repeated reruns, and plans that depend on one reviewer staying free all sprint. Those patterns usually mean your estimate missed queue time.

How much past data do we need before this becomes useful?

One month of shipped work often gives enough signal. You do not need a big system at first; a simple table with estimate, review rounds, failed CI runs, retry time, and ship date already shows where time goes.

How should I estimate work in risky parts of the codebase?

Add more buffer when the change touches auth, billing, migrations, or deploy scripts. Those areas pull in extra review, manual checks, and slower approval because one mistake can cause bigger damage.

What should I do if the likely case no longer fits the sprint?

Say it early and cut scope before the sprint starts. A smaller promise that clears review and tests beats a bigger promise that stalls halfway through and forces the team to rush at the end.