Automation project KPIs to track after your first pilot
Automation project KPIs show whether a pilot cuts delays, exceptions, and manual work so you can judge real operational impact.

Why pilot results can fool you
A pilot usually looks better than the work it is meant to replace. People watch it closely. Test cases are cleaner. Everyone wants the first run to succeed. That makes it easy to mistake a smooth demo for real operating impact.
Most pilots skip the messy parts of daily work. A task may move quickly once someone opens it, but that ignores time spent waiting in a queue, getting sent back for missing details, or sitting with another team for approval. If you only measure the happy path, the result will look faster than reality.
Early testers also protect the pilot without trying to. They pick simpler cases, know the process better than the average user, and stay patient when something breaks. Live work is less forgiving. Hard cases show up fast, inputs are unclear, and odd customer requests start eating time.
Small volume hides problems too. Ten items can move through a workflow with no visible stress. A hundred items can expose slow handoffs, rate limits, or a step that still needs a person to clean things up. Most operations do not run in a neat, even flow. Work arrives in clumps.
Timing matters just as much. One quiet Tuesday tells you almost nothing about month-end, payroll week, or the Monday after a holiday. If the pilot only ran during a calm stretch, it did not test normal pressure.
That is why post-pilot metrics should come from day-to-day operations, not a vendor demo or a single good run. Measure the process as it actually happens, including delays, rework, and human fixes. Those are the numbers you can defend when it is time to expand the project.
The numbers that matter most
After a pilot, teams often focus on success rates and demo screenshots. Those can look great while the real process still feels slow and messy. The numbers that matter are simpler: cycle time, exception rate, and manual touch minutes.
These stay close to the actual work instead of the tool screen. If staff still chase approvals, fix broken records, or re-enter data, the pilot removed less work than it first seemed.
Cycle time is the full time from trigger to done. Pick the real start point and the real finish point, then keep them fixed. If an invoice arrives at 9:00 and the approved record lands in the finance system at 3:00, the cycle time is six hours. Do not start the clock when someone opens the automation dashboard. That hides waiting time.
Exception rate shows how often work leaves the standard path. A case becomes an exception when someone has to step in, ask for missing information, correct data, retry a failed step, or move the work to email or chat. If 200 cases enter the process this week and 30 need special handling, the exception rate is 15%.
Manual touch minutes show how much human effort still sits inside the flow. Count every minute people spend across handoffs, not just the time in the first team. If one person checks a field for 2 minutes, a manager spends 1 minute approving, and finance spends 3 minutes fixing a mismatch, that case used 6 manual touch minutes.
These three numbers work as a set. Cycle time shows speed. Exception rate shows how often the flow breaks. Manual touch minutes show how much labor remains.
Write the definitions down and use the same ones every week. Keep the same trigger, the same done state, and the same rule for what counts as an exception or a touch. If those move around, the trend stops meaning much.
Set your baseline before you compare
A useful metric starts with a plain baseline. Do not compare a polished pilot to a messy month of normal work. Compare the same process, handled by the same kind of team, over a clear date range.
Start narrow. Pick one process, not a bundle of related tasks. If three teams handle it in different ways, begin with one team so the numbers come from a single routine instead of a mixed bag.
The date range matters more than most teams expect. Measure the old workflow for at least two full weeks before you compare it with the automated version. A few days rarely tell the truth. Mondays behave differently from Fridays, and month-end can distort everything.
Make the old workflow measurable
Write the baseline in plain language. Count how many items came in, how many staff hours went into them, and how many business hours passed before each item finished. If the process only runs during office hours, use that clock instead of 24/7 elapsed time.
Teams skip this step all the time. Then they claim the pilot cut cycle time when the real change was lower volume or a quieter week.
You also need a fixed definition of done. Be specific. A task is not done when someone clicks "approve" if finance still has to correct fields by hand later. It is done when the full workflow reaches the point the business actually cares about.
A short baseline note should answer five questions:
- Which process did you measure?
- Which team handled it?
- What were the exact start and end dates?
- What counted as finished?
- How did you record volume and staff time?
Keep the method simple enough that another manager could repeat it and get nearly the same result. That is the standard worth aiming for.
If you skip this step, the whole review turns into opinion. If you do it well, later numbers like cycle time, exception rate, and manual touch minutes start to mean something.
Measure cycle time step by step
Cycle time only helps if everyone measures the same thing. Pick one clear start event and write it in plain language. For an invoice flow, that might be "the invoice arrives in the finance inbox." For support, it could be "the customer request enters the ticket queue."
Do the same for the end event. Use it every time. "Approved" and "completed" sound close, but they can describe different moments in a real process. If one person stops the clock at approval and another stops it when the record reaches the ERP, the numbers lose value fast.
Use timestamps from the systems people already work in. Email, ticketing tools, CRMs, ERPs, and approval systems already log when work enters, changes state, and finishes. That is usually safer than asking staff to track time by hand.
A simple setup is enough. Define one start status and one end status. Pull timestamps from the same tool each week. Exclude records with missing or edited timestamps. Tag blocked cases separately from normal flow.
Look at median time before average time. Averages get pulled around by a few ugly cases. If three approvals finish in 20 minutes and one gets stuck for two days, the average makes the whole process look slower than it usually feels. The median gives a more honest view of the normal case.
Blocked cases still matter, but they should not hide what the automation changed. Put supplier errors, missing data, compliance holds, and manager vacations in a separate bucket. Then compare two views: normal throughput and total elapsed time.
This is where teams often go wrong. They celebrate a faster demo, then measure live work with fuzzy start points, mixed end points, and no clear split between routine work and exceptions. Tight definitions fix most of that.
Track exceptions and manual touch minutes
A pilot can look fast on clean test cases and still create extra work in daily operations. That is why exception rate and manual touch minutes matter so much. They show where the process falls out of the standard path and how much staff time it still takes to rescue it.
Start by writing down every reason a case leaves the normal flow. Be literal. Do not use vague labels like "system issue" or "needs review" if several different problems sit inside them. If someone fixes a broken field mapping, chases a missing document, approves a risky case, or re-enters data from email, log those as separate reasons at first.
After a week or two, group the repeats under simple labels such as missing input, failed validation, duplicate record, approval needed, or data re-entry. The labels do not need to be clever. They need to be consistent.
Then count each human action. If one order needs two approvals and one data fix, that is three touches, not one. Teams often undercount because they track which cases went wrong, but not how many times a person had to step in.
Time each action on its own. Do not measure the whole wait from inbox to resolution, because much of that time is idle. Measure the actual work: 40 seconds to correct a field, 2 minutes to verify a value, 90 seconds to approve. That is what makes manual touch minutes useful. Ten short interventions can quietly eat an hour.
A small hand check every week keeps the numbers honest. Pick 20 to 30 cases, read the logs, and compare them with what people actually did. This is usually where teams find missed touches, vague labels, or work handled in chat and never recorded in the main system.
These metrics are hard to fake. Vendor demos rarely show messy cases. Your operations data will.
Turn the numbers into a weekly scorecard
A weekly scorecard keeps the conversation grounded. Pilot wins often look better than daily reality, so put the same few numbers in front of people every week and make them easy to compare.
Keep it to one page. If someone needs three dashboards to understand what changed, the scorecard is too busy.
Show the baseline and the current week side by side. Put transaction volume next to each metric or the numbers can mislead you. A drop in cycle time means less if volume also dropped by 40%.
| Metric | Baseline | This week | Volume | Note |
|---|---|---|---|---|
| Cycle time | 18 min | 11 min | 1,240 items | One approval rule removed |
| Exception rate | 7.5% | 9.2% | 1,240 items | Supplier format issue on Tuesday |
| Manual touch minutes | 320 min/day | 140 min/day | 1,240 items | No staffing change |
This layout does two jobs at once. It shows whether the process improved, and it shows whether the workload changed. Without that context, people can mistake a quiet week for a better system.
Add a short note when something unusual happened. An outage, a policy change, a staffing gap, or a new approval rule can move the numbers quickly. One plain sentence is enough.
Do not overreact to tiny swings. If exception rate moves from 3.1% to 3.3%, that is usually noise. If it jumps from 3% to 8% in one week, someone should check the process that day.
Share the scorecard with operations first. They live with the process and can usually explain what changed faster than anyone else.
A simple example from invoice approvals
A finance team received supplier invoices by email, opened each file, and typed the amount, date, supplier name, and PO number into the finance system by hand. The pilot looked impressive because the tool could read most invoices in seconds. On a demo screen, it felt as if the hard part was solved.
Daily work told a different story. Clean invoices moved much faster, but messy ones slowed the team down. If a supplier sent a complete invoice in the usual format, cycle time fell from about 18 minutes to 6. If the invoice missed a PO number or had unclear totals, the tool stopped, raised an exception, and sent the job back to finance for review.
After the first week, the picture looked more like this:
- Clean invoices: cycle time 6 minutes, manual touch 3 minutes
- Exception invoices: cycle time 24 minutes, manual touch 12 minutes
- Overall exception rate: high enough to wipe out much of the time saved on easy cases
Manual touch minutes stayed high for one simple reason. Finance staff still checked totals before posting the invoice, even when the tool extracted the fields correctly. Early caution like that is common, and often sensible.
That is why live metrics matter more than a polished demo. A fast result on half the invoices does not mean the full process is faster. If exceptions pile up, total cycle time can even get worse.
The team avoided a bad rollout by fixing the input rules first. They tightened supplier forms, made required fields truly required, and cleaned up the cases that caused most exceptions. Only then did they expand the pilot to more suppliers.
Mistakes that hide the real impact
The fastest way to misread results is to count only the cases that finish cleanly. Teams report a high automation rate while awkward cases still eat most of the staff time. If a person jumps in to fix data, chase an approval, or rerun a task, that work belongs in the result.
Averages cause trouble too. A process can look fast on paper because most items finish in 5 minutes while a smaller batch sits for 2 days in review. The team does not feel the average. They feel the long waits and the angry follow-up messages.
Pilot numbers also get distorted when fresh automated work is mixed with old backlog. This happens often in invoicing, support queues, and internal approvals. Old items were already late before the pilot started, but they still drag down the report. The opposite can happen too: a team clears backlog by hand, and the pilot gets credit for the catch-up.
Another blind spot sits outside the tool. People patch records in chat, confirm details by email, or track exceptions in a spreadsheet because the new flow does not cover every case yet. On the dashboard, cycle time drops. In reality, the work just moved somewhere no one measures.
One more mistake is changing the process and the KPI definition at the same time. If you remove an approval step, rewrite the intake form, and start measuring a new version of cycle time in the same week, you cannot tell what caused the gain.
A short checklist helps keep the numbers honest:
- Count every case, including failures and assisted runs.
- Separate new pilot traffic from backlog.
- Track manual work outside the main system.
- Keep KPI definitions fixed while you test changes.
If the dashboard says things are faster but the team still feels buried, believe the team first. That mismatch usually points to work your measurement missed.
Quick checks before you trust the numbers
A clean dashboard can still tell the wrong story. Before you treat the metrics as proof, check whether people measure the same thing the same way. If one person starts the clock when a request arrives and another starts it when the bot picks it up, the cycle time number is already broken.
Write down the start point and end point in plain language. Then ask two people from different teams to label five real cases on their own. If they disagree, fix the definition before you compare weeks or tools.
Manual work often hides outside the automation tool. Someone fixes a bad record in email, updates a spreadsheet, or pushes a task through chat, and none of that appears in the logs. Keep a simple way for staff to record manual fixes, even if it is just a shared note or a short form.
The sample also needs normal pressure. A quiet Tuesday and month-end Friday do not behave the same way. Pull cases from busy days, quiet days, and at least one period when the team felt overloaded.
Spikes need plain answers. When cycle time doubles or manual touch minutes jump, someone should explain the change in one sentence. "A supplier changed the invoice format" is useful. "The system had issues" is not. If nobody can explain the spike, check the data before blaming the process.
Last, show the numbers to the people who run the work every day. Operations leaders usually spot bad metrics quickly because they know when the data feels off. If the scorecard says manual effort fell by 60% but the team still stays late to clear the queue, trust the queue, not the chart.
Good metrics are boring in the best way. People define them the same way, capture hidden work, cover normal demand swings, and match what the team sees on the floor.
What to do after the first 30 days
After 30 days, you usually know enough to make a clear decision. Keep the pilot running if cycle time drops, exception rate stays under control or falls, and manual touch minutes go down. When all three move in the right direction, the change is helping the team instead of just looking good in a demo.
Do not rush into a wider rollout if one number improves and the others do not. A faster process that still creates the same number of exceptions often shifts work to people later. The same is true when manual touch minutes barely change. If staff still spend the same time fixing, checking, or re-entering data, the automation is not doing enough yet.
Pausing is often cheaper than scaling a weak setup. If exceptions stay flat or manual touches barely move, stop adding volume. Pick one root cause, fix it, and measure again for another week or two.
Most root causes are not mysterious. Bad input data, unclear approval rules, and edge cases no one mapped before the pilot can all distort the numbers. Fix one issue at a time so you can see what changed.
Use the weekly scorecard to choose the next step. Maybe the right move is a similar workflow with the same pattern. Maybe it is a smaller task with fewer exceptions. Maybe the process still needs better rules before any rollout.
If the review cuts across operations, product, and engineering, an outside technical lead can help. Oleg Sotnikov at oleg.is works as a fractional CTO and startup advisor on AI adoption, automation, and infrastructure, which fits this kind of post-pilot review.
Frequently Asked Questions
Why isn’t a strong pilot success rate enough?
Because pilots usually run under cleaner conditions than daily work. People watch them closely, testers pick easier cases, and small volume hides queue delays, rework, and approval bottlenecks.
Which KPIs matter most after the first pilot?
Start with cycle time, exception rate, and manual touch minutes. Those three numbers show speed, how often the flow breaks, and how much staff work still remains.
How should I define cycle time?
Pick one real start event and one real finish event, then keep both fixed every week. Start the clock when work actually enters the process, not when someone opens the automation tool.
Should I use average or median cycle time?
Use the median first because it shows the normal case better. A few ugly delays can pull the average up and make the process look slower than it feels most days.
What counts as an exception?
Count a case as an exception when someone leaves the standard flow to fix data, ask for missing details, retry a failed step, or move the work to email or chat. If a person has to step in, log it.
How do I track manual touch minutes?
Time each human action inside the flow, even if it only takes a minute or less. Include every approval, data fix, verification, and re-entry across teams so you capture the real labor left in the process.
How long should I track a baseline before I compare results?
Measure the old process for at least two full weeks before you compare it with the automated version. That usually gives you enough busy days, quiet days, and routine variation to avoid a misleading comparison.
How do I stop backlog from skewing the numbers?
Separate new pilot traffic from old backlog from day one. If you mix them, late old items can drag the report down, or a manual cleanup can make the pilot look better than it is.
What should a weekly automation scorecard include?
Keep the scorecard simple and show baseline versus current week side by side with transaction volume. Add a short note for unusual events like an outage, a policy change, or a supplier format change.
When should I roll the automation out wider?
Scale when cycle time drops, exceptions stay under control or fall, and manual touch minutes go down. Pause when only one number improves, fix the main cause, and measure again before you add more volume.