Mar 15, 2025·7 min read

AI adoption business case: why finance should join early

An AI adoption business case gets clearer when finance joins early. Track model spend, review time, and error cost before you scale.

AI adoption business case: why finance should join early

Why teams miss the real cost

Most teams start with the easiest number to find: the model bill. A prompt costs a few cents, so the project looks cheap. That is where the math usually breaks.

The model price is only one line in the business case. People still need to read outputs, fix bad answers, handle edge cases, and answer customers when something goes wrong. If a task needs human review 40% of the time, labor can pass model spend very quickly.

A simple support workflow shows the problem. Say an AI assistant drafts 1,000 replies. The model cost may look too small to worry about. But if staff spend two minutes checking each draft, or ten minutes correcting the ones that break policy, payroll becomes the bigger expense. Teams that skip this step approve a budget that looks neat in a spreadsheet and feels expensive in real life.

Errors add another layer. One wrong answer rarely stays contained. It can trigger refunds, extra support tickets, rework by operations or engineering, and manual follow-up with the customer. That chain reaction is easy to miss because no single department owns all of it. Product sees the feature, support sees the complaint, finance sees the refund, and nobody pulls the full cost together.

This is why finance should join early. Finance asks harder questions. What share of outputs will need review? What error rate can the business afford? What happens to margin if usage doubles next quarter? Those questions push the team past hopeful assumptions.

This is also why experienced technical advisors tend to focus on architecture and operating cost, not just model choice. Oleg Sotnikov, for example, often frames AI projects around total workflow cost rather than API price alone. Cheap AI is not the same as affordable AI.

When finance joins before rollout, weak assumptions show up sooner. That saves money, and it saves teams from launching something they will spend months babysitting.

The numbers that shape the case

A useful business case has three cost buckets: model spend, review labor, and error cost. Most teams look at the first number and stop.

Model spend is the easiest part to see, but teams still undercount it. They price one clean request and forget retries, failed calls, test runs, and long outputs that use more tokens than expected. If a task costs 4 cents on paper, a 15% retry rate and a habit of returning long answers can push the real cost much higher.

Review labor is often larger than model spend. Every person who checks, edits, approves, or sends AI output adds cost. That includes managers who spot-check work, support leads who fix wrong replies, and specialists who step in when the model hesitates. If a worker spends even 45 seconds per result, that time adds up fast at scale.

Error cost starts the moment bad output creates more work. A wrong summary forces a second review. A bad customer reply creates a follow-up ticket. A wrong extraction from a document can lead to a refund, a delay, or a compliance issue. These costs do not appear on the model bill, but the business still pays them.

Volume changes all three numbers at once. More requests mean a bigger API bill, more review time, and more chances for mistakes. That sounds obvious, but many teams still test on 200 tasks and assume the same economics will hold at 20,000.

A simple check helps. Estimate monthly volume, then run three cases: expected, busy month, and messy month. In the messy month, raise retries, review time, and error rate together. That is closer to real operations than a clean average.

Lean AI workflows reduce all three costs at once. The best ones use shorter prompts, tighter outputs, automatic checks, and clear routing rules so only risky cases reach a human. That is when the business case starts to hold up.

How to estimate model spend

Model spend looks tiny in a demo. It gets real when people use the tool all day.

Do not start with a monthly budget guess. Start with task volume. Count how many times each workflow will call a model in a normal week. A support team that drafts 2,000 replies a week has a very different cost profile from a sales team that writes 80 follow-up emails.

Then measure four inputs for each task:

  • weekly request count
  • average prompt size
  • average output size
  • retry rate

Retry rate is easy to miss. People rerun prompts when the answer is weak, too long, blocked, or in the wrong format. Even a 10% to 20% retry rate can move spend more than most teams expect.

A rough estimate is enough at first. If one task runs 2,000 times a week, uses 1,200 input tokens and 400 output tokens, and gets retried 15% of the time, the rough total is 3.68 million tokens for that week. That number is much more useful than saying "we will use AI a lot."

Price normal traffic and busy periods separately. Most teams have peaks: product launches, billing cycles, holiday support, month-end reporting. If traffic doubles for one week each month, your model bill will not match the neat average in your spreadsheet. If you use a premium model as a fallback during peaks, split that traffic out too.

Test usage needs its own line as well. Teams spend tokens before launch on prompt tuning, QA, evaluations, failure checks, and internal trials. Early on, test traffic can rival live traffic. Later it usually settles down, but it never drops to zero if the team keeps improving prompts and workflows.

A clean estimate usually has three buckets:

  • test and setup usage
  • normal live usage
  • peak live usage

That gives finance a range instead of a guess. It also gives the team a way to compare model choices before they commit.

How to price review labor

Human review often costs more than the model call. Thirty seconds sounds small until you multiply it by hundreds of tasks a day and add the people who step in for edits.

Start by mapping every role that touches the output. Do not stop at the first reviewer. Many teams count the agent or analyst, then forget the team lead, compliance check, or manager approval that appears when a case looks risky. In practice, the review path often includes a front-line reviewer, a specialist who fixes unclear output, a manager who signs off on sensitive cases, and sometimes legal or compliance staff for regulated work.

Use real timings, not guesses from a planning meeting. Pick 20 to 50 recent tasks, run the AI-assisted flow, and measure the time. Include reading, checking sources, editing, and logging the result. Small timing errors matter. If you guess 45 seconds and the real average is 2 minutes, the math is wrong before the project starts.

It helps to split review into two buckets. Quick approval work is cheap. Full rewrite work is not. If 80% of outputs need a 20-second check, but 20% need a 6-minute rewrite, use both numbers. A single average hides the part that eats the budget.

Use loaded labor cost, not base salary alone. If a reviewer costs the company $45 an hour, a 2-minute review costs about $1.50. If a manager adds another minute for 10% of cases, that adds about 7.5 cents per task at the same hourly rate. Those small add-ons matter when you scale.

This step matters even more for lean teams. A small team saves money only when review stays light. If senior staff keep rewriting outputs, the tool may still help quality, but the economics change.

How to count error cost

Trim Waste Before Launch
Tight prompts, routing, and checks can lower spend without extra complexity.

Most teams underprice mistakes. They count the model bill, then forget the hours people spend fixing bad output, calming customers, and cleaning up records later.

Error cost often decides whether the numbers hold up. A cheap model can still be expensive if staff keep correcting its work.

Start by sorting errors into simple groups tied to real cleanup work:

  • minor errors that slow someone down but do little damage
  • serious errors that create rework, escalations, or customer friction
  • expensive errors that hit money fast, such as duplicate payments, bad invoices, chargebacks, or broken promises to customers

Then put dollars on the cleanup. Do not stop at the visible refund. Count the full chain. If one bad answer creates a support ticket, add the ticket cost. If an employee spends 12 minutes fixing the record, add labor. If finance issues a credit, add the credit. If a manager reviews edge cases, add that time too.

A simple estimate often works:

error cost = refunds or credits + rework labor + support handling + manual fixes

Say an AI workflow handles incoming invoices. A minor error takes 5 minutes of clerk time. At $24 an hour, that is about $2. A serious error may need 20 minutes from a clerk and 10 from finance, which might land near $14 to $18. An expensive error could trigger a duplicate payment or vendor dispute, and that can jump to $150 or much more.

Use a range when impact changes by case. One mistake may need a quick correction. Another may create three emails, an approval delay, and a refund. Best case, usual case, and worst case are enough for most teams.

If you know the error rate, multiply it by volume. Even a 1% failure rate across 50,000 tasks adds up quickly when each serious error costs real staff time. This is another reason finance in AI projects matters so much: finance already knows where small mistakes turn into real expense.

A customer support example

A support team handles 1,000 customer replies a week. They use AI to draft every reply, but agents still check each one before it goes out.

That setup sounds cheap until you count all three costs together.

Simple weekly math

Assume the team pays agents a loaded rate of $28 an hour. Before AI, an agent writes each reply in about 2.5 minutes, so 1,000 replies take roughly 41.7 hours. That is about $1,168 a week in labor.

With AI, the work changes instead of disappearing. In one realistic week:

  • 820 replies look good and an agent approves them in 40 seconds
  • 140 replies need a rewrite and take 3 minutes each
  • 40 replies are risky or unclear, so the agent escalates them and spends 10 minutes each
  • model spend for drafting and a few checks adds another $18 for the week

That review work adds up to about 22.8 hours. At $28 an hour, review labor costs about $638. Add the $18 in model spend, and the AI workflow costs about $656 for the week.

On paper, the team saves about $512 a week compared with manual writing. That looks good. But finance should ask one more question: how often does the system create a costly mistake?

Where savings disappear

Imagine the AI sends one refund promise that breaks policy. The agent misses it during a busy shift, and the company honors a $600 credit to avoid a larger complaint. That single error wipes out the week's savings.

The same thing can happen with a wrong shipping promise, a bad return exception, or a reply that pushes an angry customer to leave. Even if those cases are rare, they belong in the math.

Approval speed alone does not prove the case. If most replies pass in under a minute, the numbers can look strong. But if a small share needs heavy rewrites, or one bad answer costs real money, the margin gets thin fast.

Common forecasting mistakes

Build Lean AI Workflows
Set up automation your team can run without constant babysitting.

Vendor pricing is a starting point, not a budget. Teams often look at the price per million tokens, multiply by rough usage, and stop there. That misses the extra calls around the main task: retries, safety checks, tool calls, summaries, logging, and fallback models when the first answer fails.

A support bot is a common example. On paper, one customer chat may look like one cheap model request. In practice, that chat can turn into six or seven requests once the system rewrites the prompt, fetches account data, asks another model to review tone, and retries after a timeout.

Review labor also gets cut too aggressively in early forecasts. People assume the model will handle almost everything after a short setup period. It rarely works that way. Someone still reads outputs, fixes edge cases, checks policy risk, and handles conversations the model could not finish.

That review time may shrink later, but it usually does not fall anywhere near zero. If a human spends even 45 seconds checking each output, labor can be larger than the model bill.

Long conversations break weak estimates too. A ten-turn exchange is not ten equal requests. Each new turn often carries earlier context, so token use grows as the conversation gets longer. Add retries, clarifications, and tool errors, and the cost per session climbs quickly.

Where forecasts usually fail

The worst forecasts use best-case accuracy as the default. Teams test on easy prompts, get a strong result, and project that rate across real work. Real work is messier. Users write unclear requests. Internal data has gaps. Rules change. The model makes small mistakes that humans must catch.

Error cost is the part people avoid because it feels awkward to price. It still belongs in the business case. A bad answer can trigger a refund, a compliance check, a lost sale, or ten extra minutes of staff time. If the process touches money, contracts, or customer trust, even a small error rate can wipe out the savings.

Teams that run lean operations usually estimate the ugly version first: average accuracy, real review time, longer chats, and some failure. If the project still works under that pressure, the case is probably real.

Checks before you commit

Get Fractional CTO Help
Work with Oleg on AI rollout, architecture, and practical cost control.

Many teams approve an AI project with rough averages and hope the details sort themselves out later. That usually makes the math look better than it is. A cleaner business case starts with four numbers you can verify before anyone signs off.

First, pin down the current cost per task. Do not use a monthly team budget and divide it by a guess. Pick one real task, count the minutes, include software and vendor costs, and work from there. If a support agent handles a ticket in 6 minutes at a loaded cost of $30 per hour, that task costs about $3 before AI enters the picture.

Then measure review work. Someone will check the output, and that time is not free. Name the reviewer, measure how long they spend, and note when senior staff step in. Five minutes of review by a team lead can erase the savings from a cheap model very quickly.

You also need error data, even if it feels messy. Do not stop at "the model makes mistakes." Separate the errors into simple buckets, such as a wrong answer caught quickly, a wrong answer that needs rework, a wrong action sent to a customer, or an issue that creates a refund, credit, or lost sale. Those buckets help you price recovery instead of treating every error as the same.

Last, make finance and operations use the same assumptions. Teams often model different numbers for volume, review rate, or wage cost. Operations may assume reviewers check 20% of outputs. Finance may model 100% review for the first quarter. Both can sound reasonable, but they lead to very different results.

A one-page check is often enough:

  • current cost per task
  • expected model cost per task
  • review time and reviewer pay rate
  • error rate by type
  • recovery cost for each error type

If any line is still a guess, mark it clearly. That does not kill the project. It tells you where to run a small pilot first.

What to do next

Do not roll this out across five teams at once. Pick one workflow with clear volume, a clear owner, and a simple before-and-after measure. Customer support, invoice checks, lead qualification, and internal document summaries work well because teams already know how often the work happens and who reviews it.

Bring finance in from day one. One person from product or operations should own the pilot, and one person from finance should check the numbers with them. That keeps the test grounded in real cost instead of guesswork.

Run a short pilot for two or three weeks and log the same fields every day:

  • model spend
  • review time
  • errors, rework, refunds, or missed tasks
  • task volume and turnaround time

Keep the setup boring on purpose. Use the same prompts, the same review rule, and the same success measure for the full pilot. If the team changes the process every other day, the math will not tell you much.

When the pilot ends, rebuild the case from the data. Compare the new process to the old one on a per-task basis. A support team might find that each AI-assisted reply adds only a few cents in model spend, cuts handling time by a minute, and still needs 30 seconds of human review. That is enough to make a clean decision. If error cost wipes out the labor savings, stop and fix the workflow before you expand it.

At that point, the discussion gets much simpler. You are no longer debating hopes or vendor claims. You are looking at your own volume, your own review labor cost, and your own error cost.

If a rollout touches product, infrastructure, and finance at the same time, outside help can speed up the decision. Oleg Sotnikov at oleg.is works with companies on AI-first development, lean infrastructure, and Fractional CTO support, so this kind of cost review fits naturally into a pilot before a wider launch.

Frequently Asked Questions

Why is model price alone a bad way to judge an AI project?

Because the API bill is only one part of the cost. People still spend time checking answers, fixing weak drafts, and cleaning up mistakes, and that labor often grows faster than model spend.

What costs should I include in an AI business case?

Start with three buckets: model spend, review labor, and error cost. Then add real volume, retries, test usage, and peak periods so the estimate matches daily work instead of a neat demo.

How do I estimate model spend without guessing?

Begin with task volume, not a monthly budget guess. Count how many requests each workflow makes, measure average prompt and output size, add retry rate, and split test traffic from normal and peak live traffic.

How much human review is too much?

It becomes a problem as soon as humans review almost every result or senior staff rewrite too many of them. Even 45 seconds per task adds up fast at scale, and a small share of 5 to 10 minute fixes can wipe out the savings.

How should I put a dollar value on AI mistakes?

Sort mistakes by cleanup cost, then price the work each one creates. Add refunds or credits, support time, manual fixes, rework, and any manager or finance time that follows the error.

When should finance join the project?

Bring finance in before rollout, and ideally before the pilot starts. Finance pushes the team to test real assumptions on review rate, error cost, and margin under heavier usage instead of trusting best case numbers.

What is the best first workflow to test?

Pick one workflow with clear volume, a clear owner, and an easy before and after measure. Customer support, invoice checks, lead qualification, or internal summaries usually work well because teams already track the work.

How long should an AI pilot run?

Two or three weeks usually gives you enough data if you keep the setup stable. Use the same prompts, the same review rule, and the same success measure so you can compare results cleanly.

When does AI actually save money in customer support?

The math works when most drafts need only a quick check, few cases need full rewrites, and costly errors stay rare. If agents or managers keep stepping in, the tool may still help, but the savings will shrink fast.

What forecasting mistakes do teams make most often?

Teams often assume one task means one cheap model call, use best case accuracy, ignore retries and long conversations, and forget test usage. They also cut review time too hard and leave error cost out because it feels messy.