AI startup due diligence questions for accelerator teams
AI startup due diligence questions help accelerator teams test demos, margins, review load, and delivery risk before they back a company.

Why strong AI demos still hide weak businesses
A smooth demo proves one thing: the team can control one moment. It does not prove they have a product people can rely on every day.
That gap matters when an accelerator or program team reviews an AI startup. Founders can show a fast result with clean data, a polished prompt, and a narrow use case. Real customers do none of that. They upload messy files, ask vague questions, skip instructions, and expect the product to recover.
The first job in technical review is to separate the demo from the product. A demo follows a happy path. A product has to survive odd edge cases, weak inputs, slow model responses, and users who click the wrong thing. If the startup looks good only when the team prepares everything first, the business may be much weaker than the pitch suggests.
Hidden manual work is a common reason for that gap. Teams sometimes clean the data before the meeting, rewrite prompts between steps, retry the model until they get the best answer, or fix the output before anyone sees it. The result can look impressive, but it also means the product may still depend on labor.
A few direct questions usually cut through the polish. What happens when a customer uploads bad data? Which steps still need a person? How often does the model need a second try? Who checks the result before it reaches the user? Those questions sound basic. That's why they work.
Margins can disappear just as fast. A demo might process 20 requests cheaply. A real customer may send 20,000. Then the economics change. Model calls, retries, human review, support time, logging, and error handling all start to matter.
A startup can look efficient on screen and still lose money on every active account. If the team needs people in the loop for most outputs, or if the model burns too much compute on routine work, growth makes the problem worse.
The best demos still hold up when you swap in messy inputs, remove off-screen help, and ask what one finished task costs at real volume.
Questions that expose the product behind the pitch
Founders can make almost any AI product look smooth for ten minutes. They know the right prompt, the clean sample data, and the one path that avoids edge cases. That proves they can present. It does not prove customers can use the product on a normal Tuesday.
Some of the best review questions are simple on purpose. Ask what the product does when the founder is not in the room. Then ask for a live run where another team member uses it with messy input and no coaching. If the product works only with founder guidance, you may be looking at a service hidden inside a slick demo.
Start with model dependence. Ask which model providers power the product today and which features depend on each one. Then ask what breaks first if an API price rises, a provider changes limits, or a model update hurts quality. Many teams say they can switch models anytime. Often they have never tested that claim.
Next, ask what customers actually return to do each week. A vague answer like "they use it every day" tells you almost nothing. You want the repeated task, the user who does it, and the reason it matters. If the team cannot describe the routine behavior clearly, they may not have real product use yet.
A small example makes the point. A startup shows an AI tool that writes sales follow-up emails. The demo looks great. The better question is whether sales reps open it every week, trust the draft, edit one or two lines, and send it inside their normal workflow. If reps copy the text into another app, rewrite half of it, and stop using it after a few days, the product is still thin.
Provider dependence matters for quality as much as cost. If one model change makes the main feature slower, weaker, or unusable, the startup does not control its own user experience yet. That is worth finding early.
What happens when the model gets it wrong
Good technical review focuses on failure, not polish. A smooth demo tells you very little about how the product behaves on a bad day.
Ask founders to show real mistakes: wrong answers, unsafe outputs, slow responses, and cases where the model refused a task it should have handled. If they show only best-case prompts, take that as a warning. Strong teams keep a small library of failures and can explain what caused each one. Weak teams say the model is "usually accurate" and move on.
You also need to know who catches errors before customers do. Some startups rely on a human reviewer for every response. Others review only risky cases. Either model can work, but you need the numbers. If one person checks half of all outputs, the business may break as volume grows.
Ask for a few concrete details. What bad or unsafe outputs have they seen in recent tests or production? At what exact step does someone or something check the result? What does the product do when the model gets slow or times out? Which logs does the team keep, and who gets alerts when failure rates rise?
Latency matters more than many founders admit. When response time jumps from 3 seconds to 20, users retry, abandon the task, or send work to support. Careful teams plan for that. They may switch to a simpler model, return a partial result, queue the task, or hand it to a person with a clear deadline.
The logs should tell a plain story. You want prompt, model version, output, reviewer action if there is one, and final result. Alerts should go to a named owner, not a vague shared inbox.
A support example shows why this matters. If a startup drafts replies to customer complaints, one wrong answer can trigger refunds or angry escalations. Ask how they catch that answer before it reaches the customer, and what happens if the review queue backs up on a busy day. The answer usually tells you more than the demo did.
Where the margins disappear
One question gets fuzzy answers fast: "What does one finished job cost?" A revenue slide will not tell you. Ask for cost per completed task, not monthly cloud spend and not a blended forecast.
That number should include every step needed to get a usable result. Model calls, retrieval, storage, logging, support time, and any human cleanup all count. If the team cannot break that down in plain numbers, they probably do not know their margin yet.
Demo economics almost always look better than customer economics. In a demo, the founder can hand-pick inputs, watch every run, and fix failures on the spot. Real customers send longer prompts, messy files, duplicate requests, and edge cases that trigger extra calls and more support.
Ask for gross margin at two points: current volume and realistic customer volume. The gap matters. A product that looks cheap at 50 jobs a day can turn ugly at 5,000 if latency forces retries or long inputs push the model into a more expensive tier.
Human review is where many products quietly lose money. Teams pay contractors or staff to rewrite outputs, verify facts, remove bad formatting, or approve risky cases before delivery. That is not automatically a flaw. It does change the business fast.
Say a startup charges $3 for a document summary. On stage, the model cost looks like $0.30, so the unit economics seem fine. After launch, 20% of documents need a second pass, 10% need a person to fix the output, and long files trigger another model call. The margin can shrink from acceptable to almost nothing.
Retries deserve their own question. One customer job may look like a single prompt, but the product may actually run a classifier, a retrieval step, a generation step, a validator, and then a fallback model when the first answer fails. Ask how often that happens and what the average job really uses.
You want a small table, not a polished story: price per task, average model cost, average human minutes, retry rate, and gross margin at low and normal volume. That tells you more than a strong demo ever will.
How much human review the startup really needs
An AI product can look automatic and still depend on a lot of people behind the scenes. If a startup needs humans to check, fix, or approve most outputs, the demo may sell a story the business cannot scale.
Ask the team to map the full path from user input to final result. Count every manual step, including the small ones founders call temporary. One person may label data, another may approve risky outputs, and someone else may rewrite answers before customers see them.
One question works especially well: "Who touches the output before the customer uses it?" Keep going until you get names, time spent, and volume per day. If the answer stays vague, the startup probably does not track review load well.
Founders often do this work themselves early on. That is normal, but it hides the real cost. If the CEO and CTO still review most responses at night, the company has not built a repeatable process yet. During accelerator screening, that should lower confidence in the current margin and speed claims.
Watch the queue, not just the headcount. A startup may say it has only three reviewers, but that number means little without throughput. Ask how many items one reviewer clears in an hour, how often reviewers need to rewrite instead of approve, what share of outputs need a second look, and how long customers can wait when volume spikes.
The math gets ugly quickly. If sales double but review time per customer stays flat, the queue can grow faster than revenue. Then the company has to hire reviewers before it earns enough to cover them.
Picture a startup that sells AI-generated support replies. Each agent reviews 120 drafts a day, but 35% need edits and 10% need full rewrites. If new customers add 2,000 drafts a day next quarter, the team needs more reviewers almost at once.
Human review is not a flaw by itself. Some products should keep a person in the loop. The problem starts when the startup treats that labor like a temporary detail instead of a core cost.
A short screening flow for program teams
Most first-pass reviews get too abstract too fast. A better approach is simple: pick one promise the startup makes to customers, then inspect one real workflow that delivers it.
If a founder says, "We turn support tickets into ready-to-send replies in 30 seconds," stay with that claim. Do not jump to market size or long roadmaps yet. Ask them to walk through one ticket from raw input to final output.
A five-step screening flow works well:
- Start with the user action. What does the customer upload, type, or click, and what result do they expect at the end?
- Trace the full path. Ask what happens before the model call, during it, and after it, including rules, retrieval, formatting, and delivery.
- Count every model call and every human touch. One polished demo can hide three background prompts, a manual reviewer, and a founder fixing edge cases off screen.
- Pressure test cost, speed, and failure handling. Ask what happens if the model is slow, wrong, blocked by policy, or too expensive on a busy day.
- Write down one red flag and one follow-up question before the meeting ends. That forces the team to leave with a concrete concern, not a vague feeling.
This process sounds basic, but it exposes weak businesses fast. Thin margins show up when a startup needs several expensive calls to produce one cheap output. Review bottlenecks show up when a person must check most results before a customer can use them.
Imagine a startup that says it can draft investor updates automatically. In the review, you learn the system pulls data from five tools, makes four model calls, and still needs a staff member to fix tone and numbers before sending. The demo may look smooth, but the workflow is slow, costly, and hard to scale.
By the end of this pass, program teams should know where the product can break, where money leaks out, and what they still need to verify in the next call.
A simple example from an accelerator review
A startup pitches an AI tool that gives sales teams instant contract summaries. In the demo, a rep uploads a clean PDF, waits a few seconds, and gets a neat summary with payment terms, renewal dates, and risks pulled out correctly.
That looks good in a meeting room. It tells you almost nothing about daily use.
The demo file set often includes polished documents the team picked in advance. Real customers upload phone scans, contracts with handwritten notes, pricing tables pasted from spreadsheets, screenshots inside email threads, and files with missing pages. Once reviewers try those messy inputs, the output changes fast. The model skips table footnotes, mixes up dates, or treats a redline comment like a final term.
The startup still ships decent results because two people sit behind the product and fix edge cases before anything reaches the customer. That manual layer is easy to miss during a live demo. Founders may call it quality control, but it is really part of the product cost.
At low volume, the numbers can still look fine. If the team handles 20 contracts a day, two reviewers may keep up. If usage jumps to 200 contracts a day after a pilot goes well, those same reviewers become the bottleneck. The company now needs more staff, slower delivery, or lower accuracy. Any of those can crush margins.
Three follow-up checks usually expose the gap. Ask the team to run the product on random customer files, not the prepared demo set. Ask what share of outputs need human fixes before delivery. Then ask how review time changes when volume doubles or triples.
This example is simple, but it catches a common problem. A smooth demo can hide weak document handling, hidden labor, and a business model that works only while usage stays small.
Mistakes teams make when they judge AI startups
A smooth demo can hide a shaky business. Program teams often reward polish and speed, then miss the parts that hurt six months later: thin margins, constant review, and a support queue that keeps growing.
One early mistake is accepting vague cost answers. Founders should know what one customer task costs in model calls, infrastructure, and staff time. If they answer with guesses or say they will fix it later, the margins may already be too tight.
Another common mistake is treating prompt quality as product depth. A smart prompt can make a short demo feel impressive. That does not mean the company has hard-to-copy workflows, stable integrations, or a product people will keep paying for.
Manual work also disappears in polished meetings. The output looks instant, but someone may have cleaned the data, retried failed runs, corrected bad answers, or reviewed every result before the call. If the startup needs a person in the loop for most customer jobs, the product may be a service wearing software clothes.
Benchmarks create another trap. A team may report 90% accuracy on one internal test and still disappoint customers. Buyers care about whether the tool saves time, avoids costly mistakes, and works on messy inputs. A support copilot that scores well in a lab can still flood human agents with escalations after launch.
Support load deserves more attention than it gets. Many AI products look cheap to sell, then become expensive to maintain because users need help with wrong outputs, edge cases, and unclear settings. Founders should tell you who handles those issues, how often they happen, and how the rate changes as usage grows.
The plain questions are usually the best ones: unit costs, reviewer involvement, repeat use, customer retention, and ticket volume after launch. Those answers matter more than one perfect demo.
A quick checklist before you issue an offer
Before you send terms, run five checks. They catch most of the ugly surprises.
- Ask the founders to show what happens when the input is vague, incomplete, or plainly wrong. Good teams know their failure cases and can explain what the user sees when the model is unsure.
- Ask for cost per job today with real numbers, including model cost, human review time, and any expensive processing around the model.
- Check whether one operator can review more work over time. If every new customer needs the same amount of human checking, margins stay thin.
- Test how tied they are to one model. Teams that can switch models, or at least route different tasks to different models, have more room to manage price, speed, and uptime.
- Look for repeat use, not isolated trials. A dozen impressive pilot users mean less than a smaller group that comes back every week.
One small detail matters a lot: ask for evidence, not promises. A team does not need perfect dashboards at this stage, but it should have basic numbers, a few failure stories, and a clear sense of where the product still needs human help.
If the founders answer these points with specifics, you may have something real. If they answer with general confidence, wait.
Next steps after your first technical pass
After the first pass, push the team one step past the pitch. Ask for a short follow-up memo with real numbers, not a longer deck. You want figures you can compare across the batch.
A useful memo is brief. It should tell you what task the model handles today, how often a human checks the output, the error rate in live or pilot use, the gross margin after model, storage, and support costs, and what breaks when usage jumps 5x.
Numbers change the conversation fast. "We review 12% of cases and spend about 6 minutes on each review" is useful. "We mostly automate the workflow" is not.
Before you finalize support, run a small technical review. Keep it narrow. One reviewer can inspect a sample workflow, trace where the model makes decisions, check what logs exist, and estimate whether the startup can keep quality steady without adding a large operations team.
Use the same question set for every company in the cohort. That keeps louder founders from winning on confidence alone. A startup with a weaker demo may still deserve support if its margins are cleaner, its review load is lower, and its failure cases are easier to control.
If the answers still feel thin, an outside review can help. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor on AI-first software and infrastructure, and this kind of review is exactly where experienced technical operators are useful. They can spot hidden manual work, fragile architecture, and cloud costs that will not hold up once usage grows.
Then match your support to the risk you found. Some teams are ready for funding. Some need credits and a short runway to prove unit economics. Some need hands-on help with architecture, review loops, or deployment before more money makes sense.
Do not treat every AI startup the same. Back the teams that can explain their numbers and survive contact with real users.
Frequently Asked Questions
What should I test first when an AI startup demo looks impressive?
Start with one customer promise and ask the team to show the full workflow behind it. A polished ten-minute path means very little unless the product also handles messy input, delays, and user mistakes.
How do I spot hidden manual work behind the demo?
Ask for a live run with random or messy customer data and no founder coaching. Then ask who edits the output, who retries failed runs, and who checks the result before a user sees it.
What unit economics numbers should founders know?
Get cost per finished task, not just monthly cloud spend. That number should include model calls, retries, storage, logging, support time, and any human review needed to deliver a usable result.
How can I tell if customers will actually keep using the product?
Look for a repeated weekly job with a clear user and a clear reason to come back. If the team can only say users "love it" or "use it every day," they probably do not understand real usage yet.
Why does model dependence matter so much?
A startup that leans on one model provider can lose quality, speed, or margin after a price change or model update. Ask what breaks first and whether the team has tested another model in the same workflow.
What should I ask about bad outputs and failures?
Tell the founders to show real failures, not just best-case prompts. You want to see wrong answers, unsafe outputs, slow responses, and refusals, plus what the product does next when those problems show up.
How much human review is too much?
There is no fixed percentage, but you should worry when people check or rewrite a large share of outputs and the team cannot say how long that work takes. Review load becomes a business problem fast when volume grows.
What does a healthy AI operations setup look like?
You want logs that show the prompt, model version, output, reviewer action if there is one, and the final result. You also want alerts to go to a named owner who can act when failure rates or latency rise.
Should I care more about accuracy scores or real workflow use?
Workflow fit usually matters more. A lab score can look strong while the product still fails on messy files, slows people down, or creates enough errors that support and review costs wipe out the value.
When should an accelerator ask for an outside technical review?
Bring in outside technical review when the demo looks smooth but the answers on cost, review load, failure handling, or provider risk stay vague. A good reviewer can trace the workflow, find hidden labor, and test whether the margins hold up at real volume.