Sep 28, 2025·8 min read

Write eval cases without Python from tickets and sheets

Write eval cases without Python by turning support tickets, spreadsheets, and screenshots into repeatable tests your product team can keep current.

Write eval cases without Python from tickets and sheets

Why good feedback never reaches tests

Support teams usually spot the problem first. A customer sends a ticket, adds a screenshot, and explains what went wrong in plain language. That report often contains the exact details a future test needs: the account type, the unusual input, the screen state, and the result the customer expected.

Then the issue moves upstream, and the detail starts to disappear. A messy real case gets compressed into a short note like "export failed for some users." That summary is easy to pass around, but it drops the conditions that made the bug happen.

Engineering teams then do what they are supposed to do. They fix the reported case, ship the patch, and move on. If nobody turns that ticket into a repeatable test, the team solves one instance and misses the pattern.

That is why the same issue shows up again a month later after a refactor, a UI update, or a change in data handling. The original customer report had enough signal to protect the product, but it never became part of the test set. People remember the bug for a while, then memory fades and the app slips back into the same mistake.

A spreadsheet import bug is a common example. Support may know the failure only happens when one column mixes dates and text, and only for accounts using a certain template. By the time the bug reaches an engineer, the story may be reduced to "import error on CSV." That is not enough to catch the next regression.

Teams that want domain experts to write test cases need to protect that raw evidence before it gets cleaned up too much. Tickets, notes, screenshots, and spreadsheets are not clutter. They are often the best source of realistic tests because they show how real people break the product in ways nobody planned for.

What a useful eval case looks like

A useful eval case is small, specific, and easy to judge. If two people read it, they should reach the same result. That matters more than fancy tooling.

Start with one user trying to do one thing. Do not bundle three problems into one case. If a customer asked, "Why did the summary leave out the refund date?" you already have a strong starting point.

Use input from real work. A support ticket, a spreadsheet row, a pasted chat, or a screenshot of a bad result works better than an invented example. Real inputs carry the odd wording, missing context, and messy details that usually break a product.

Describe the expected result in plain words. Write what a person should see, not how the system should do it. "The summary mentions the refund date and the final amount" is clearer than "The model extracts all relevant entities."

The pass rule should also stay simple. A product manager, support lead, or analyst should be able to check it in a few seconds. If the answer includes both required facts, it passes. If one is missing, it fails.

Most solid cases include four things:

  • a short name
  • the real input that caused trouble
  • the result you want in normal language
  • the exact reason it passes or fails

Take a case built from a customer complaint. The input is the original message: "I returned the order on March 3 and still have not received my $84.20 refund." The expected result says the reply or summary must mention the return date and the refund amount. The rule says it fails if either fact is missing.

That is enough. If the case needs a full meeting to explain, it is too vague.

Where domain experts can find test material

Start with evidence your team already trusts. Good test material usually sits in the same places people use to report bugs, explain edge cases, and confirm what the product should do.

Customer tickets are often the best source. They give you the exact words people used, the steps they followed, and the result they expected. A cleaned-up product summary often removes the detail that made the issue real.

Spreadsheets are another strong source, especially when teams track imports, pricing rules, approvals, or messy account data there. One strange row with a blank field, a date in the wrong format, or a currency mismatch can become a strong repeatable test. The boring rows rarely teach you much. The ugly ones do.

Screenshots help when text leaves room for argument. A good screenshot can show the wrong label, the missing warning, the broken layout, or the number that clearly does not match the source record. Save the image next to the case so anyone can see what the user saw.

Policy notes, internal docs, and review comments also help. They are useful when the problem is less about a visible bug and more about a rule the product must follow every time.

Turn raw evidence into one repeatable test

Start with one recent issue, not a broad category. A real support ticket, a spreadsheet row from QA, or a screenshot from a customer chat works better than a made-up example. Recent cases are easier to trust because the team still remembers what happened and why it mattered.

Then shrink the evidence down to one user action and one expected result. If the original ticket says, "The export failed after I changed the date range and edited two filters," do not keep every side detail. Keep the exact step that seems to trigger the problem, and write the outcome the user expected.

Before you save anything, remove private details. Replace names, emails, account IDs, prices, and anything else that points to a real customer. If a screenshot helps, crop it or blur the sensitive parts. The test should preserve the behavior, not the identity of the person who reported it.

A simple case usually answers four questions: where the evidence came from, what the user did, what result they expected, and what redactions or edits you made.

That source note matters more than many teams realize. Six weeks later, someone will ask, "Why do we have this test?" If the case includes "based on ticket 1842 from March billing review" or "taken from the onboarding issues sheet," the answer is already there.

Keep the wording simple enough that a support lead or product manager can update it without asking an engineer for help. That is how teams keep non-coder test cases useful over time.

A repeatable test should stay small on purpose: one issue, one action, one expectation, one source note. If a ticket contains three different failures, split it into three cases. Small cases are easier to review, rerun, and fix when the product changes.

Use a format non-coders can edit

Get Practical CTO Support
Work with Oleg on AI driven workflows, team process, and product decisions.

If a new test needs a script file, a repo, and a code review, most domain experts will stop after the first try. A better setup is a sheet, database table, or board where each case lives in one row or one card.

The format should look like normal work, not like a programming task. A short template is usually enough:

  • scenario
  • input
  • expected result
  • check

These fields are simple on purpose. "Scenario" explains the situation in plain English. "Input" holds the customer ticket, prompt, form entry, or copied message. "Expected result" says what a good answer or action should do. "Check" turns that into a pass rule, such as "mentions the refund deadline" or "does not claim the order shipped."

Keep the rule simple: one case should answer one question. If a row tries to test tone, policy accuracy, edge cases, and formatting all at once, nobody knows why it failed.

Two extra fields help a lot over time: owner and review date. The owner tells the team who can confirm a change when the product, policy, or wording shifts. The review date keeps old cases from sitting around long after they stop matching real customer work.

Screenshots also deserve a place in the case instead of getting buried in chat or saved on one person's laptop. A screenshot of the original ticket, a marked-up UI state, or a pasted reply often clears up details that text alone misses. Store that evidence next to the row or card so the case still makes sense months later.

Keep the format a little boring. That usually means it will last. If a support lead or product manager can update a case in two minutes during a review meeting, the team will keep using it.

A simple example from a customer ticket

A support ticket often has everything you need for a useful eval case. Imagine a customer says their invoice total is wrong after a discount. They attach the order details and circle the total in a screenshot because that is the number that looked off.

The team does not need code to turn that into a test. They can copy the facts from the ticket into a small table or form: item price, quantity, discount, tax rule, and the total the system should show. The screenshot removes guesswork about which field failed.

A plain version of the case could look like this:

  • Item price: $120
  • Quantity: 2
  • Discount: 25%
  • Tax: 10%
  • Expected invoice total: $198

Now the rule is clear. The system should calculate the subtotal first, apply the discount, then add tax. If the product shows $264 or $216, the case fails. If it shows $198, it passes.

That sounds simple because it is. It also fixes a common problem. Teams often store the ticket, discuss it once, patch the bug, and move on. A month later, the same pricing mistake returns in a different checkout flow. When domain experts can keep these checks alive, the team does not have to wait for an engineer to translate every issue into a test.

Why this example works

It uses real customer evidence instead of a made-up edge case. Anyone on the team can read it and see what should happen. Finance, support, product, and QA can all confirm the expected total without reading a test script.

It is also easy to maintain. If pricing rules change, the team updates one expected number or one formula note. The ticket becomes a repeatable test, and the screenshot gives future reviewers quick context when the numbers look odd again.

How screenshots make cases clearer

Keep Bug Context Intact
Build a review flow that saves source notes, screenshots, and clear expectations.

Screenshots stop a common argument: "What did the user actually see?" A ticket might say "the result looked wrong," but an image can show the exact screen state at that moment. You can see the empty field, the warning text, the wrong total, or the missing status label.

That helps when domain experts are writing and reviewing cases. They may not describe a bug in code terms, but they can point to a screen and say, "This message should appear here" or "This value should not be blank." That is enough to start a solid test.

A useful screenshot does not try to capture the whole page. It focuses attention on the one part that decides whether the case passed or failed. Mark the field, message, row, or output that matters. A small box, arrow, or note is usually enough.

Keep the written note next to the image short. It only needs to say what input or setup produced the screen, what part of the screen to check, and what text, value, or state should appear.

Without that note, the image turns into a memory test. People look at it later and guess what they were supposed to notice. That guesswork kills trust.

Images also fail on their own because they are easy to misread. Two screens can look almost the same while the actual output is wrong. A currency value may differ by one digit. A status might say "Pending" instead of "Paid." A hidden field may drive the wrong result even though the layout looks fine.

A short written expectation fixes that. For example: "Given invoice type B and a tax-exempt customer, the total shows $0 tax and the warning banner does not appear." The screenshot shows where to look. The sentence tells the team what must be true.

Mistakes that make teams stop using evals

Teams stop trusting evals when the test set feels messy, vague, or outdated. The problem usually is not the idea. It is how the cases get written and maintained.

A common mistake is packing too much into one case. Someone takes three customer complaints, mixes them together, and calls it one test. When that case fails, nobody knows what broke. Was it the wording, the logic, or the formatting issue? One case should check one thing.

Another problem is weak expected results. If the expected answer says "looks right" or "seems fine," the case cannot guide anyone. Two reviewers will read that line and make two different calls. Good evals use plain, concrete checks: a field must match, a label must appear, or a summary must include one fact and exclude another.

A few mistakes show up again and again:

  • One case tries to test several failures at once.
  • The expected result uses soft language instead of a clear pass rule.
  • The team keeps the test but loses the ticket, sheet row, or screenshot that started it.
  • Old cases stay active after the product changes, so the set punishes current behavior for not matching old rules.

The missing source problem hurts more than teams expect. Months later, someone asks, "Why do we even test this?" If nobody saved the original ticket or screenshot, the case turns into folklore. People argue about intent instead of checking evidence.

Stale cases are just as bad. A product team updates a workflow, changes a form, or rewrites copy, but the eval set stays frozen. Then the tests fail for the wrong reason. After a few rounds like that, people stop opening the results because they assume half the failures are noise.

If you want people to keep using evals, keep each case narrow, write exact expected outcomes, save the source, and review old cases on a schedule. A smaller set that people trust beats a large set that everybody ignores.

Quick checks before a case goes live

Fix Vague Pass Rules
Work through real cases and write checks your team can score fast.

Teams often get stuck on one simple problem: the case still needs a meeting to explain it. If a support lead, product manager, or ops person cannot read it quickly and score it with confidence, it is not ready.

A good case feels plain. It names the source, shows the input, states the correct result, and explains why that result matters. Someone new to the project should understand it in about a minute.

Use a short check before you add any case to your shared sheet or test library:

  • A non-engineer can read the case once and tell what is being tested.
  • The pass rule is specific enough that two reviewers would give the same score.
  • The case comes from a real ticket, real spreadsheet row, or a written business rule.
  • The team can run it again after every release without hunting for missing context.

The second point matters a lot. "Looks good" is not a rule. "Invoice total matches the source sheet, tax included, and the customer name stays unchanged" is a rule.

Real origin matters too. Cases pulled from customer tickets catch pain that users already felt. Cases built from policy docs catch the quiet mistakes that trigger refunds, delays, or compliance problems. If a case matches neither, it usually turns into busywork.

Rerun speed is the last filter. If the case depends on one person remembering an old screenshot, a private chat message, or a hidden spreadsheet tab, it will die after two releases. Put the input, judgment rule, and source note in one place.

A case is ready when it is easy to read, easy to judge, and easy to rerun. That is what makes a small eval set survive real product work.

Next steps for a process people will keep

A process survives when it starts with real work people already trust. Pick 20 cases from recent customer tickets, not a giant backlog from six months ago. Recent tickets use the same words customers use, and the team still remembers what went wrong and what the correct result should be.

Give the case list one clear owner outside engineering. Support is often the best fit. Product can work too. That person does not need to code. They need to keep case descriptions clear, ask for missing evidence, and make sure old cases do not pile up forever.

A small review rhythm works better than a heavy process. Put a short meeting on the calendar every two or four weeks. Keep it simple. If the meeting feels boring, that is usually a good sign.

In each review, the team can add a few new cases from fresh tickets, remove duplicates and stale cases, fix expected results that sound vague, and flag the cases that should run before each release.

That is enough to keep the workflow useful. Most teams do not fail because the format is weak. They fail because nobody owns the list, nobody reviews it, and nobody cleans it up. A plain spreadsheet or shared table is fine if people actually update it.

If the list grows fast, split ownership by product area, but keep the same format across teams. A billing case, a search case, and a support workflow case should all look familiar on the page. That makes handoffs easier and cuts down on arguments about what a test is supposed to prove.

Some teams need help setting this up so support, product, and engineering can all maintain it. Oleg Sotnikov at oleg.is works with startups and small businesses on practical AI-augmented development processes, product architecture, and Fractional CTO support. If your team is stuck between customer tickets and repeatable tests, that kind of hands-on setup can save a lot of wasted time.

Frequently Asked Questions

What makes an eval case useful?

Use one real input, one expected result, and one clear pass rule. If two people read the case and score it the same way, the case is ready.

Can non-engineers write eval cases?

Yes. Support, product, QA, and ops can write strong cases if they use plain language and real evidence. They do not need Python if the case format stays small and easy to edit.

Where should we get test material from?

Start with customer tickets, spreadsheet rows, screenshots, policy notes, and review comments. Those sources show real failure patterns instead of neat examples nobody actually sees.

How much of the original ticket should we keep?

Keep the detail that triggers the problem and drop the noise around it. You want one user action and one expected outcome, not the full story of the whole ticket.

How do we protect private customer data in test cases?

Remove names, emails, account IDs, prices, and anything else that points to a real person or company. If you need a screenshot, crop or blur the sensitive parts and keep only the evidence that explains the behavior.

Why should we save screenshots with the case?

Screenshots show exactly what the user saw when words get fuzzy. Pair the image with a short note that says what to check, or people will guess and score the case differently.

What format works best for teams that do not code?

A shared sheet, table, or board works well because domain experts already use those tools. Give each case a few fields like scenario, input, expected result, check, owner, and review date.

How do we write a pass rule people can actually use?

Write the rule so anyone can make a quick yes or no call. "Mentions the refund date and amount" works better than "looks right" because it leaves less room for debate.

How often should we review eval cases?

Review the set every two or four weeks and clean it as you go. Add fresh cases, remove duplicates, and update anything that no longer matches the product or policy.

Why do teams stop using evals after a while?

Teams lose trust when cases get vague, packed with too many checks, or drift away from the current product. Save the source, keep each case narrow, and retire old ones before they turn into noise.