Feb 05, 2025·8 min read

Synthetic test sets that product and ops teams can build

Learn how product and ops teams can create synthetic test sets from screenshots, emails, and customer tasks without code or heavy tools.

The problem with ad hoc AI testing

Most teams test AI the way they review a rushed draft. Someone remembers three customer requests, runs them once, and decides the result looks fine. That feels practical, but memory is a terrible way to sample real work. People remember the loud cases, the recent ones, and the ones that annoyed them most.

A strong demo makes this worse. If the model handles one polished example in a meeting, everyone relaxes. The quiet failures stay hidden: a messy screenshot, an email with missing context, or a customer request that mixes two problems in one message. One good answer can hide a long pattern of misses.

Product and ops work also changes from case to case. One order issue needs a simple reply. The next depends on an old policy, a vague screenshot, and a note buried in a long email thread. When teams test with a few handpicked examples, they end up testing the easy version of the job instead of the real one.

Then a second problem shows up. People judge quality from memory after release. Support says the assistant keeps missing edge cases. Product says the last demo looked solid. Ops says the output was fine for the tasks they tried. All three can be right, but they are talking about different cases.

That is when the discussion stalls. The team argues about taste instead of results. One person cares about speed. Another cares about tone. Another cares about policy. Without shared examples, nobody can show whether the latest prompt, model change, or workflow edit made things better or worse.

Synthetic test sets fix that. They give product and ops teams a fixed set of realistic cases built from the work they already see every week. When everyone scores the same cases against the same standard, weak spots stop hiding behind a polished demo.

What a useful test case looks like

A good test case starts with a real customer task, not an invented prompt. If a customer sends a screenshot, forwards an email, or asks for a status update, that is already useful source material. Real tasks include the small details that make systems trip up.

Keep each case narrow. Use one input and ask for one result. If you mix three jobs into one case, nobody knows why the answer failed. A test case should answer one simple question: given this input, did the system do the right thing?

Write the expected action the way one coworker would explain it to another. Plain language beats abstract labels. Instead of "demonstrates correct intent classification," write "send this to billing and ask one follow-up question about the invoice number."

Context matters because small facts often change the right decision. A late shipment email from a new customer may need a different reply than the same email from a long-term account with a pending renewal. Add only the facts that change the choice. Extra background makes scoring harder.

Most good cases have five parts:

the original input, such as an email, form, screenshot, or short task
the goal, written as one clear outcome
the context that changes the decision
the expected action in plain language
a short note on what counts as wrong

That last part prevents a lot of arguments. "Bad answer" should mean something specific. Maybe the system gives advice without asking for missing information, routes the task to the wrong team, ignores a policy exception, or uses the wrong tone.

A simple example shows the difference. Suppose the input is an email that says, "I was charged twice, and I need this fixed today." The goal is not "handle billing." The goal is "identify a duplicate charge complaint and send it to billing with urgency." If the customer also says they already opened ticket #4821, that context matters. A bad answer would tell them to open a new ticket, miss the urgency, or promise a refund the team cannot approve.

That is what makes these sets useful. Each case stays small, grounded, and easy for humans to judge the same way.

Collect source material without writing code

Good cases start with work your team already sees every day. You do not need code, a data pipeline, or a lab setup. You need a small batch of real material that shows what customers ask for, where the model gets confused, and what a correct answer looks like.

Screenshots from common workflows are a good place to start. Product teams can grab images from onboarding, checkout, form filling, account changes, or any screen where users often stop and ask for help. One screenshot with a short note like "user cannot find the export button" is often better than a made-up prompt.

Emails are another strong source. Save real customer messages after you remove names, account numbers, addresses, and anything else private. Keep the wording close to the original. If customers write messy, incomplete, or emotional messages, that is useful. Your cases should reflect real input, not cleaned-up examples that never happen in practice.

Tickets and internal docs often contain the missing context. A short note from a support ticket, QA report, sales handoff, or runbook can shape the case and make scoring easier later. "Customer wants to change billing owner but only has viewer access" already tells reviewers quite a lot.

Before you write prompts, group similar cases together. Put password resets in one group, billing changes in another, and account access problems in a third. That helps you avoid building ten versions of the same case while missing other work that matters.

A spreadsheet or shared folder is enough for a first pass. Keep the original source, the cleaned version, a short description of the task, and the expected outcome in the same place. When people disagree about a result, they can check the source instead of arguing from memory.

If your first batch has 20 to 30 cases, that is enough. Small sets built from real work usually teach more than large sets built from guesses.

Build your first set in six steps

Start with one workflow that people touch every day. Pick something boring and frequent, not a rare disaster case. If your team handles invoice approvals, order changes, or refund requests all week, start there.

First, keep the scope tight. One workflow is enough for a first set. When teams mix sales emails, support tickets, and internal requests together, they usually get messy results and end up debating rules instead of testing the AI.

Second, collect 20 to 30 recent examples. Use screenshots, emails, chat threads, forms, or short task notes. Real work beats invented prompts because it carries the details that trip systems up.

Third, rewrite each example so any teammate can read it quickly. Remove private names, account numbers, and anything sensitive, but keep the facts that matter. If an email says, "Please move delivery to Friday and split the invoice," the cleaned version should still show both requests clearly.

Fourth, label the correct outcome for each case. Write what a good answer or action looks like in plain language. Then mark the tricky parts. Maybe the request is missing a date. Maybe the customer asks for two things at once. Maybe policy says a human has to approve it.

Fifth, score simply. Pass, partial, or fail is enough for an early round. Add one short note for partial scores. "Understood the task but missed the approval rule" tells your team much more than a vague number.

Sixth, review the set with one domain expert before you run it. Pick the person who knows the workflow, not the person who likes testing tools. Ask where the wording feels unclear, where labels seem wrong, and which examples feel too easy.

That review matters more than fancy tooling. A small, clean set beats a huge pile of random prompts. Once the labels make sense to a human, you have a base the rest of the team can trust.

Score outputs so humans agree

Review Your AI Workflow

Find where prompts, routing, or context break before customers do.

Talk to Oleg

If two reviewers score the same answer very differently, the case is weak. Teams usually cause this by handing people a long policy document and hoping they read it the same way.

A short rubric works better. Keep each rule concrete and easy to score in under a minute.

Use a short rubric

For many cases, four checks are enough:

Facts: Did the answer use the right details from the case?
Action: Did it do the task asked, such as draft a reply, classify the issue, or suggest the next step?
Tone: Did it sound right for the customer and situation?
Completeness: Did it cover the whole request without skipping a needed part?

Use simple scoring. Pass or fail works for routine work. If you need more detail, use 0, 1, and 2, where 0 means missed, 1 means partly right, and 2 means clearly right.

Test the rubric on a small batch first. Ask two people to score the same 10 cases. If they disagree often, tighten the wording. Do not add more debate.

Split clean cases from messy ones

Routine cases and messy cases should not share the same bar. A password reset email with a clear account number is easy to judge. A complaint with missing facts, mixed emotions, and two possible resolutions is harder.

Keep those groups separate. Your pass rate stays honest, and reviewers stop fighting over edge cases that were never clean to begin with.

Do not force a score when the case itself is unclear. Mark it for human review if the source material is incomplete, the requested action clashes with policy, or two reviewers can defend different answers. Those cases still matter, but they belong in a review bucket, not in the summary chart.

Track one short note every time a case fails. Ten words can be enough: "wrong order status," "missed second question," "tone too cold," or "invented refund policy."

Those notes matter as much as the score. After 20 or 30 cases, patterns show up fast. Product and ops teams can then fix the prompt, the workflow, or the source material instead of guessing.

A support inbox example

A good support eval starts with work your team already sees every week. Pick a small batch of refund requests, delivery delay complaints, and account access emails. Those categories force different kinds of judgment: policy checks, order lookups, and identity or account steps.

Add the message itself and the context a human agent would use. That usually means a screenshot of the order page, a screenshot of the account page, and any short note that explains the situation, such as "package marked delivered" or "subscription renewed yesterday." If the model will act inside a tool later, your eval should still focus on the decision first.

A small starter set

Keep the first set simple enough to review in one sitting. For example:

a refund email for an order that shipped late and arrived damaged
a delay complaint where tracking has not updated for four days
an account email from a user who cannot log in after changing their password

For each case, ask the model for the next reply or the next action. Do not ask for everything at once. "Draft the next support reply" is clear. "Choose the next action based on policy" is also clear. If you ask for reply, action, tone, escalation, and tags in one prompt, scoring gets messy fast.

Then score the result against the path your team expects. Did the model choose refund, replacement, wait, identity check, or escalation to a person? That decision path matters more than pretty wording. A reply can sound polite and still send the case the wrong way.

A simple score sheet usually works better than a long rubric. Check whether the model followed the right policy path, used the facts from the screenshots and email correctly, handed off to a human when needed, and avoided making things up.

Keep a few ugly examples in the set. One email might miss the order number. Another might mention two orders in the same thread. A screenshot might cut off the delivery date. Real support work is messy, and the test set gets better when it keeps that mess instead of filtering it out.

If a model does well on clean tickets but fails when one detail is missing, you learned something real. That kind of gap is fixable.

Mistakes that waste time

Improve AI Handoffs

Check when the model should reply, ask a question, or route to a person.

Review My Flow

Most teams do not lose time because the model is terrible. They lose time because the test set is too clean, too vague, or too hard to judge.

A common mistake is copying only easy examples from smooth workflows. Clean screenshots, polite emails, and perfect handoffs make almost any system look decent. Real work has typos, missing details, conflicting notes, and rushed requests. If your cases skip that mess, your results will give you false confidence.

Teams also write cases that depend on hidden background knowledge. A prompt like "handle this refund the usual way" sounds normal inside a company, but a reviewer cannot score it unless they already know the policy, the exception list, and the customer history. Put the needed context inside the case itself. If a new teammate cannot read the case and judge the answer, the case is not ready.

Scoring rules often drift halfway through testing. Someone sees a few outputs, changes their mind about what counts as correct, and updates the rubric on the fly. Then old scores and new scores no longer match. You cannot compare runs if the target keeps moving.

Another time sink is stuffing several jobs into one prompt. If one case asks the model to summarize an email, tag urgency, draft a reply, and decide whether to escalate, nobody knows what failed. Split the work into separate cases unless the real task truly happens as one action.

Privacy mistakes can stop the whole effort. Real customer emails and screenshots often contain names, phone numbers, order details, or account data. Remove anything a reviewer does not need before you store or share the material. Keep the task shape, the tone, and the business context. Drop the identity.

A few habits prevent most of this:

sample messy cases, not just happy paths
put all scoring context inside the case file
freeze the rubric before the first run
test one task per case when possible
anonymize customer material early

Teams that follow these rules spend less time arguing about scores and more time finding the places where the model actually breaks.

Quick checks before you run the eval

Fix Demo Driven Testing

Replace handpicked prompts with cases from real support and ops work.

Start Review

A short review before the run saves hours of bad scoring. Good test sets usually feel a little plain at this stage, and that is fine. Clarity matters more than cleverness.

Start with a simple check: hand a few cases to a new teammate. They should understand the task, the source material, and what a good output should do without a meeting first. If they ask, "What am I supposed to look for?" the case is still too vague.

Keep each case focused on one decision or one action. Ask the model to classify a request, draft a reply, extract a field, or route a task. If one case asks for all four, reviewers will disagree even when the output looks decent.

Clean the inputs before anyone scores them. Remove names, email addresses, account numbers, and anything else that points to a real person or company. Replace them with plain labels like "Customer A" or "Order 1234" so the case stays realistic without exposing private data.

Reviewers also need the same scoring notes. A short rubric with a few pass and fail examples works better than a document nobody reads. If one reviewer cares most about tone and another cares most about accuracy, your scores will drift for no good reason.

Your set should look like normal work, not only problem cases. Include routine examples that happen every week, plus a smaller batch of messy ones with missing details, mixed signals, or awkward wording. That mix gives you a more honest read on how the model will behave in production.

A support team can see this quickly with one inbox example. A basic password reset case checks whether the system follows the steps. A messy billing email with two requests in one message checks whether the system can separate issues and still answer clearly.

If any case still feels fuzzy, cut it or rewrite it before the run. Ten clean cases will teach you more than fifty confusing ones.

What to do after the first run

Your first run gives you a map of where the system fails. Do not rush to add fifty more cases. Fix the ugliest failures first, especially the ones that would confuse a customer, create extra support work, or push a teammate into manual cleanup.

If the model keeps missing account details in emails, picking the wrong next step from a screenshot, or mixing up priority levels, solve those problems before you widen the set. Early volume can hide obvious defects. A smaller, cleaner set tells you more.

A simple rhythm works well:

group failures by type
fix the top one or two issues
rerun the same cases
check whether the score improved and whether new mistakes appeared

Keep one small frozen set that never changes. Use it as your before-and-after check. It should include the most common customer tasks and a few painful edge cases. When prompts, models, rules, or workflow steps change, run that same set again. Without it, teams fall back to memory, and memory is usually wrong.

Then grow the set slowly with fresh material from real work each week. Pull examples from support inboxes, internal handoff notes, screenshots, or recurring customer requests. Five new cases from last week's work are often better than thirty invented ones.

Review the results together. Product can spot feature gaps. Ops can see where the workflow breaks. Support can tell you whether the output would actually help a customer. A 30-minute shared review each week is enough for most teams, and it saves a lot of back-and-forth later.

If your team needs help shaping the eval process, choosing what to measure, or turning rough cases into something reliable, Oleg Sotnikov at oleg.is does this kind of work as a fractional CTO and startup advisor. That outside view can be useful when the team is too close to the problem and keeps changing too many things at once.

Frequently Asked Questions

What is a synthetic test set?

A synthetic test set gives your team a fixed batch of realistic cases to run again and again. You build it from real work like screenshots, emails, tickets, and task notes, then you write the expected outcome in plain language.

That lets product, ops, and support judge the same cases by the same standard instead of arguing from memory.

Why not just test with a few demo prompts?

Because a polished demo only shows one narrow path. Real work has missing details, messy screenshots, mixed requests, and emotional messages, and a few handpicked prompts rarely capture that.

When you test only the easy version of the job, weak spots stay hidden until customers hit them.

How many cases should we start with?

Start with 20 to 30 cases from one workflow your team sees every week. That is enough to spot patterns without drowning in cleanup and scoring work.

Pick something common and boring, like refunds, invoice changes, or account access, not a rare disaster case.

What should one good test case include?

Keep each case small. Use one real input, one clear goal, the few facts that change the decision, the expected action in plain language, and a short note on what counts as wrong.

If one case asks the model to do three jobs at once, you will not know what actually failed.

Do we need code or special tools to build the first set?

No. A spreadsheet or shared folder works fine for the first version. Save the source material, the cleaned version, the task description, and the expected outcome in one place.

You only need enough structure so reviewers can check the case and score it the same way.

How do we score outputs so reviewers agree?

Keep scoring short and concrete. Most teams do well with a simple rubric that checks facts, action, tone, and completeness.

Use pass or fail for routine work, or use 0, 1, and 2 if you want a little more detail. Then ask two people to score the same small batch and tighten the wording if they disagree often.

Should clean and messy cases stay in the same bucket?

No. Split clean cases from messy ones. Routine requests need one bar, while incomplete or mixed cases need another.

That keeps your pass rate honest and stops reviewers from fighting over cases that were unclear from the start.

How do we protect customer privacy when we use real emails and screenshots?

Remove names, email addresses, account numbers, phone numbers, and anything else a reviewer does not need. Keep the shape of the task, the tone, and the business context.

Simple labels like Customer A or Order 1234 usually preserve enough detail to make the case useful.

What mistakes make these evals a waste of time?

Teams usually waste time in three ways: they copy only easy examples, they hide needed context outside the case, or they change the scoring rules halfway through the run.

Another common problem shows up when one prompt asks for tagging, summarizing, replying, and escalation all at once. Split those jobs unless the real workflow truly combines them.

What should we do after the first run?

Group failures by type and fix the ugliest ones first. Then rerun the same cases and check whether the score improved or whether you created a new problem.

Keep one small frozen set for before-and-after checks, and add a few fresh cases from real work each week. That gives you a stable baseline without turning the set into a random pile.