Jun 22, 2025·8 min read

How to estimate AI feature costs before pricing gets messy

Learn how to estimate AI feature costs by mapping usage, retries, peak load, review time, and margin risk before finance asks hard questions.

Table of Contents

Why pricing gets messy fast

A demo hides the parts that drive real cost. One person asks one clean question, the model answers once, and the result looks cheap.

Real usage is messier. People retry, rephrase, paste rough input, refresh the page, and ask follow-up questions. One neat interaction turns into several model calls.

That is why cost estimates should start with behavior, not token prices. A model can look cheap on paper and still burn through budget once traffic grows. Even a small retry rate changes the math. If 3 out of 100 requests need another pass, that sounds minor. At scale, those extra calls pile up every day, and they often happen during the busiest hours.

Average traffic can hide the worst part. Many teams build a forecast around daily volume, then get surprised when most requests land in a short window. A support tool may sit quiet for hours and then get flooded right after a product update or billing issue. That peak does not just raise infrastructure needs. It also causes more timeouts, more retries, and more fallback calls to another model.

People costs are easy to miss too. Teams often budget for tokens and maybe some storage, but forget the time spent reviewing outputs. If a person checks replies, fixes edge cases, or handles the cases the model got wrong, that labor is part of the feature cost. In a lot of products, review and support follow-up cost more than inference.

Take a support drafting feature. In a pilot with 500 tickets, the numbers may look fine. Move the same feature to 20,000 tickets a month, add repeat prompts, moderation checks, and a review step for risky answers, and the margin changes fast. By the time finance asks for numbers, the expensive part is rarely the first demo call. It is the repeat work around it.

What to count before you open a spreadsheet

Start with user actions, not model prices. Cost estimates get shaky when you begin with tokens and skip the product behavior that creates them.

Write down every action that can trigger the feature. That might be a user sending a prompt, uploading a file, clicking "rewrite," asking for a summary, or running the same task again after an edit. If the feature runs in the background, count those triggers too. Auto-tagging, nightly summaries, and moderation checks still create cost, even when users never see them.

Map each action to the actual work

One action often creates more than one model call. A support reply draft might classify the ticket, fetch past context, generate the draft, and then run a safety check. If you count that as one call, your estimate will miss the real bill.

For each step, note the rough input and output size. Count prompt text, system instructions, retrieved documents, chat history, and any file content that gets parsed. Then add storage. Conversation history, embeddings, logs, screenshots, and audit records may look cheap per item, but they build up over time.

A simple worksheet usually needs four columns:

the user or system action
the calls created behind the scenes
the data volume for each call and anything stored after it
the people time needed when automation stops short

Do not leave people costs for later. If staff review low-confidence outputs, handle escalations, answer user complaints, or test prompt changes, that belongs in the forecast. Even 2 to 5 minutes of review on a small share of requests can change margins fast.

QA time matters too. Teams often retest prompts, edge cases, and output quality after every update. If the feature touches customer messages, contracts, or internal knowledge, someone will spend time checking whether the result is safe and usable.

The spreadsheet is the easy part. The hard part is counting everything that happens before, during, and after one user action.

Turn product usage into request volume

Total signups distort forecasts. Most people will not use the feature regularly, and some will try it once and disappear. Use active users instead: daily active users for something built into a daily workflow, or weekly active users for a feature people open less often.

Then define the exact action that creates cost. It might be a chat prompt, a summary, a rewrite, or an agent run. Count that unit, not page views or logins. A product with 2,000 active users can stay cheap if each person sends one request a week. A product with 200 active users can get expensive fast if each person runs 20 requests a day.

One average hides the real pattern. Split users by behavior so the math looks like real usage, not a clean spreadsheet.

Light users try the feature now and then.
Regular users depend on it during normal work.
Heavy users build it into a repeated task and send many more requests.

Now give each group a share of your active users and a rough action rate. Even simple numbers help. If 60% are light, 30% are regular, and 10% are heavy, you can multiply group size by actions per person and add the totals. That gives you a much better forecast than a single blended average.

After that, write three demand cases. Finance will ask for them anyway.

Low case: slower adoption and fewer repeat actions.
Base case: expected adoption and normal habits.
High case: a strong launch, one big customer, or a workflow that catches on faster than planned.

Here is a simple example. Say you have 500 weekly active users. If 300 light users make 2 requests a week, 150 regular users make 10, and 50 heavy users make 40, you get 4,100 requests a week before retries, fallbacks, and human review. That number is not perfect, but it is solid enough to test pricing.

Add retries, fallbacks, and repeat work

Most cost forecasts miss the same thing: they count the request that worked, not the extra work around it. Real products trigger more than one model call per user action, and the gap gets expensive fast.

Start with failed calls that users repeat themselves. If a response times out, looks wrong, or feels too slow, many users hit send again. Even a modest repeat rate changes the math. A feature with 50,000 monthly prompts and a 12% rerun rate does not create 50,000 generations. It creates 56,000 before you count anything else.

Your app also retries on its own. Network errors, rate limits, and short provider outages can all trigger another call. Teams often add this logic early, then forget to include it in the forecast. If 4% of requests retry once, that is 2,000 extra calls on top of those 50,000 base requests.

Fallbacks add another layer. You might send the request to a second model when the first one fails, hits a content filter, or returns weak output. That protects the user experience, but it also means one customer action can turn into two billed calls. If 3% of traffic falls back, count those calls separately because the second model may have a very different price.

Repeat work is easy to miss when one prompt triggers several steps. A rerun can repeat retrieval, classification, moderation, or post-processing, not just the final generation. If your flow rebuilds the full pipeline on every retry, costs rise across the stack.

A simple worksheet helps:

base user requests
user reruns after weak or failed output
automatic retries from the app
fallback calls to another model
extra pipeline steps repeated on each rerun

Add a buffer here, even if your logs look clean. Retries and reruns rarely stay flat once real users arrive, and they usually climb during busy hours, when errors hurt the most.

Plan for peak load, not average traffic

Price One Action Clearly

Map each user action to model calls, storage, and staff time.

Book Session

Daily averages make AI costs look smaller than they feel in production. A feature can look cheap on paper, then turn expensive during one busy hour when requests pile up, users retry, and response times slip.

Start with the busiest day you expect in a normal month. Then zoom in to the busiest hour inside that day. A support tool, for example, may stay quiet overnight and then get hit hard at 9 a.m. when agents log in, sync old tickets, and run summaries at the same time.

That hour matters more than the daily total. If you process 12,000 requests per day, finance may divide that by 24 and assume a smooth flow. Real products rarely behave that way. Many users act in clusters, after a campaign, at the start of a shift, or right after a batch import finishes.

Concurrent requests drive both the bill and the stress on your system. Ten thousand requests spread across a day is manageable. Five hundred requests that overlap in one minute is a different problem. To estimate concurrency, look at how long one request stays active, then ask how many can overlap during a burst. If one AI call takes 8 seconds and users trigger 60 calls per minute, several requests will be in flight at once.

Queues turn a traffic spike into extra cost. When users wait too long, some refresh the page, click again, or open a support ticket. That creates repeat work. Check how long a request can sit in queue before your app times out or the user gives up. Even a small delay can double cost if it causes retries.

Price the extra capacity for those busy periods, even if you only need it for a few hours each week. If average load needs 4 workers but Monday morning needs 10, your forecast should include those 10 workers, the higher model throughput, and the wasted calls from timeouts or duplicate attempts.

If you skip this step, the forecast may look healthy until the first spike. After that, margins disappear fast.

Count human review and support time

Most teams undercount this part. Model costs are easy to spot. The hours people spend checking, fixing, and explaining bad answers are not.

List every moment when a person touches the result. Do not stop at formal approval. Count moderation, spot checks, manual edits, edge case handling, and the time someone spends deciding whether the output is safe to send.

A simple map usually includes steps like these:

a reviewer checks output before it reaches the user
a moderator handles flagged content
a specialist fixes wrong or incomplete answers
support replies when users get confused
an engineer looks at failures that repeat

Put minutes next to each step. Be honest. A "quick review" often turns into 3 to 5 minutes once you include context switching, reading the prompt, checking the answer, and leaving a note.

Support time belongs in the same forecast. Wrong answers create tickets. Confusing answers do too, even when the answer is technically correct. If 1 out of 150 AI responses leads to a support ticket, and each ticket takes 8 minutes, that small number adds up fast at scale.

Edge cases need their own line. Teams often bury them inside a general review bucket, then wonder why the forecast misses. If rare cases need a senior person for 15 minutes, finance should see that cost clearly instead of hiding it inside an average.

Use one hourly rate for each role involved:

reviewer
moderator
support agent
engineer or specialist

Then multiply time by frequency. For example, 5% of outputs may need review, 0.7% may create support work, and 0.1% may need specialist help. That gives finance a clean labor layer on top of token and infrastructure costs.

This is where many pricing guesses break. The model call may cost cents. The cleanup around it can cost much more.

Build the forecast step by step

Plan for Peak Traffic

Check what busy hours do to queues, retries, and infrastructure spend.

Book Consult

Start with one user action, not a monthly total. Price one action such as "summarize a ticket" or "draft a reply," then build up from there. That keeps the math tied to real usage instead of guesswork.

Write down every model call behind that action. Include the obvious call that generates the answer, but also include routing, classification, moderation, retrieval, and any fallback call you make when the first result is weak or times out.

Then put a unit cost next to each call. Use the model price for input and output tokens, or the price per request if that is how the vendor bills. Convert it into the cost of one user action right away. A spreadsheet full of token counts is not enough. You need a dollar cost per action.

A simple forecast sheet usually has one line for each part of the flow:

what the call does
how many tokens or requests it uses
the unit price
the cost per action

Once you have that, multiply the per-action cost by three volume cases: low, base, and high. Low covers slow adoption. Base reflects the plan you expect. High should include more than extra users. Add retries, repeat prompts, and heavier use during peak hours.

Do not stop at model spend. Add review time, logging, monitoring, storage, and support time. Those costs look small when you view them alone, but together they can change the margin a lot. In support, sales, and compliance workflows, human review often costs more than the model call itself.

Finish with one comparison: total cost per action versus your planned price. If one AI-assisted task costs $0.18 and your pricing brings in only $0.12, the gap is already visible. You can shrink context, switch models, limit usage, or charge extra for heavy use before the economics get worse.

A simple example from a SaaS support tool

A support agent clicks "Draft reply" after a customer asks for a refund. The app sends the ticket text, order details, and company policy to one model that writes the first reply. Then it sends that draft to a second model that checks for risky claims, wrong refund terms, or a rude tone.

Say the product handles 20,000 support tickets a month, and 35% of them use AI drafting. That gives you 7,000 draft attempts before reruns or review time.

Now add the real usage pattern. If the average ticket plus policy text is 1,200 input tokens, and the draft reply is 250 output tokens, the first model handles about 1,450 tokens per request. The safety check might use 400 input tokens and 80 output tokens, so that adds 480 more. One clean pass is about 1,930 tokens total.

Users rarely accept the first draft every time. If 18% of agents click regenerate to make the reply shorter or more formal, those extra runs push volume up fast. Seven thousand draft attempts turn into 8,260 total draft cycles when you include reruns.

That matters because the second model often runs again too. If every regenerated reply also gets checked, your monthly token estimate is 8,260 multiplied by 1,930, or about 15.9 million tokens. That is the number finance will care about, not the neat first-pass estimate.

People costs often get added too late. If a team lead reviews 6% of AI replies before they go out, that is about 496 reviews a month. At 90 seconds each, the team spends roughly 12.4 hours on review. If support leads cost $35 an hour, review adds about $434 a month before support agents spend time editing the draft.

A simple forecast for this feature needs four numbers:

monthly tickets
AI adoption rate per ticket
rerun rate per drafted reply
review rate and minutes per review

This example is simple, but it catches the parts teams usually miss. The draft model is only part of the bill. Safety checks, reruns, and human review often decide whether the feature makes money or quietly eats margin.

Mistakes that break the numbers

Stress Test the Forecast

Turn rough usage assumptions into a cost model your team can trust.

Get Advice

A lot of teams build the forecast on clean average behavior that never shows up in real life. Average traffic hides the hours when usage jumps 3x or 5x, queues grow, and slower responses trigger even more retries. If your feature runs fine at noon but struggles every Monday morning, the cheap forecast is fiction.

Another common mistake is counting only the first request. Real usage is messy. Users refresh a page, resend a prompt, edit the input, or leave halfway through a flow. Your app may retry after a timeout, call a second model when the first answer fails, or rerun moderation and logging behind the scenes. Each of those steps costs money, and together they can change unit economics fast.

Many teams also price only model calls and forget the people around the feature. Human review, support tickets, QA checks, prompt tuning, abuse handling, and failed-output investigations all add labor cost. In some products, staff time costs more than tokens. That is easy to miss when finance asks for a margin and the spreadsheet shows only API usage.

Customer behavior almost never spreads evenly. One group may use the feature once a week. Another may hammer it all day because it saves them hours. Enterprise accounts often behave very differently from trial users, and power users can distort the total more than a large number of casual users.

A safer forecast does four things:

uses p50 and p95 usage, not just one average
adds retry rates, timeout rates, and repeat submissions
includes staff time per 100 or 1,000 requests
splits users into segments instead of one blended number

A small support tool makes this obvious. If 1,000 agents each send 20 requests a day on average, the forecast looks neat. But if 150 agents send 120 requests during peak shifts, 8% of requests retry, and 5% need human review, your cost per customer changes fast. That gap is where underpricing starts.

If the numbers still look surprisingly low, they probably are. AI costs usually break at the edges: spikes, retries, and people time.

Quick checks and next steps

A forecast works when finance, product, and engineering can all read the same page and reach the same answer. If someone needs a long walkthrough to understand your numbers, the model is still too loose.

Put the forecast into one table. Keep it plain. Show the scenario, expected usage, total monthly cost, expected revenue, and gross margin. Add a short note for the assumptions that matter most.

Scenario	Monthly demand	Total cost	Revenue	Gross margin	Notes
Low	Fewer requests, lower review rate	Lower	Lower	Stable or thin	Good for slow adoption
Base	Expected usage	Expected cost	Expected revenue	Target margin	Main pricing case
High	More requests, more retries, peak traffic	Higher	Higher	Often tighter	Stress case

Low, base, and high demand are not optional. Teams usually get surprised by the high case, not the average one. If your pricing only works when traffic is smooth and users behave exactly as planned, it will drift fast.

Mark the few inputs that move margin the most. In many products, that is not the raw token bill alone. Retry rate, fallback rate, average output length, human review share, and peak load can change the picture more than people expect. A small jump in any one of them can wipe out the buffer you thought you had.

Use a short final check:

Can one table explain the full forecast?
Did you test low, base, and high demand?
Did you label the inputs that change margin the most?
Did you include review time and support time, not only API spend?

That quick review is often where teams catch the biggest mistakes. If you want an outside pass before you lock pricing, Oleg Sotnikov at oleg.is works with startups and smaller companies on AI product architecture, infrastructure, and Fractional CTO support. A second review of usage assumptions, peak load, and human overhead can save a lot of cleanup later.

Frequently Asked Questions

Why can’t I use token prices alone to estimate AI cost?

Because users do more than one clean prompt. They retry, edit, refresh, and trigger checks behind the scenes. Start with one user action and count every model call, retry, fallback, storage write, and review minute it creates.

What should I count before I open a spreadsheet?

Write down every action that can fire the feature, including background jobs. For each action, note input size, output size, stored data, and any staff time when the model needs help.

Should I forecast from signups or active users?

Use active users, not total signups. Then split them into light, regular, and heavy groups so one average does not hide the users who drive most of the cost.

How do I estimate retries and regenerations?

Look at logs from a similar flow if you have them. If you do not, add a rerun rate for user repeats, a separate rate for app retries, and another rate for fallback calls. Small percentages can move the monthly total a lot.

Do daily averages work for traffic planning?

No. Price the busiest hour, not only the average day. Bursts create queues, slower replies, duplicate clicks, and more failed calls, and all of that raises cost.

Do storage and monitoring costs really matter?

They do once usage grows. Chat history, embeddings, logs, screenshots, audit records, and error tracking may look cheap per item, but they keep growing every day.

How do I include human review in the forecast?

Map every touch point a person handles, such as review, moderation, support follow-up, and failure checks. Give each step a frequency, minutes per case, and hourly rate. In many products, labor costs more than the model call.

What’s the simplest way to price one AI action?

Pick one action, like drafting a reply or summarizing a ticket. Add the cost of each model call behind it, then add review time, storage, monitoring, and support work. That gives you a cost per action you can compare with your planned price.

What mistakes usually break AI cost forecasts?

Teams often count only the first successful request, use one blended average for all users, and skip people time. They also miss peak-hour behavior, fallback models, and QA after prompt changes.

When should I update the forecast?

Check the forecast after launch, after a model or prompt change, and whenever usage jumps. Watch rerun rate, output length, review share, and peak concurrency, because those numbers can erase margin fast.