Dec 17, 2025·8 min read

Red team exercises for AI workflows before launch at work

Use red team exercises for AI workflows to test sales, support, and ops tasks, catch bad tool calls, and add approvals before launch.

Table of Contents

Why real task tests fail quietly

Early AI tests usually look better than real work. Teams use clean examples, short prompts, and normal cases. The model passes, sounds confident, and everyone moves on.

That confidence is part of the problem. An AI can pick the wrong tool, send a message too early, or skip a policy check while sounding calm and polished. People trust fluent answers more than they should, especially when the result looks close enough.

Real business work breaks in messy moments, not obvious ones. A sales workflow may handle a standard discount request just fine, then fail when a buyer asks for custom terms, a faster close date, and legal changes in the same thread. The prompt drops one rule because the case no longer matches the tidy examples used in testing.

Approvals fail the same way. On paper, the workflow includes a manager sign-off. In practice, an unusual case can push the AI to act first and ask later, or skip approval because one field looked optional. Teams rarely catch that in a demo. They find it after support issues a refund that needed review, or after ops updates a vendor record without the second check finance expects.

These failures stay quiet because nothing crashes. The email still sends. The ticket still closes. The record still updates. The damage shows up later, when a customer gets the wrong answer, a rep promises something the company cannot honor, or an internal team has to clean up a bad change.

That is why realistic red team exercises matter before launch. They test the awkward cases people forget to simulate: mixed intent, missing fields, strange wording, partial approvals, and requests that look harmless until the tool call goes through. If you only test the happy path, customers will find the rest for you.

Choose the workflows to test first

Start with tasks where a small mistake creates real damage. Good early tests involve sending a message to a customer, changing a record in a trusted system, or spending money through a refund, credit, purchase, or cloud action. Those cases expose weak prompts and bad tool choices fast because the outcome is visible and hard to ignore.

Do not begin with a broad sample of everything the AI might do. Pick three flows: one in sales, one in support, and one in ops. That gives you enough variety to spot patterns without creating a review pile nobody finishes.

A simple starting set works well. In sales, test a flow that replies to an inbound lead, qualifies the request, and updates the CRM stage. In support, use a billing or delivery issue where the agent needs to look up account details and decide whether a refund is even possible. In ops, pick a routine request such as an access change, a deployment step, or a service restart request.

Use examples from recent work, not workshop prompts someone wrote to make the system look good. Real tasks include vague wording, copied notes, missing details, and odd customer behavior. That mess is useful because prompt failures usually show up there first.

Remove private data before you test. Replace names, emails, phone numbers, account IDs, and anything regulated. Keep the shape of the task intact, though. If a support ticket came in angry and unclear, the test case should feel the same.

Keep the first round small enough for manual review. Three to five real examples for each workflow is usually enough. Someone should read the prompt, watch each tool call, check whether the AI asked for approval at the right time, and judge the final action.

If you start with thirty cases per team, people skim. When people skim, they miss why the workflow failed. These tests work better when the sample is small, concrete, and close to work your team handled last week.

Decide what counts as a failure

Teams usually catch obvious breakage and miss the risky stuff. A workflow can finish the task and still fail if it used the wrong tool, skipped approval, or invented a step that sounded believable.

Start with approvals. Write down every action that must stop for a human check, even when the AI sounds sure. In sales, that might mean offering a discount above a set limit. In support, it could mean approving a refund. In ops, it may mean changing production settings, deleting records, or touching customer billing.

Then make two tool lists for each task. One list names the tools the AI may use. The other names the tools it must never touch. If a support agent can read ticket history but cannot issue credits, any credit attempt is a failure even when the amount is correct.

Pass and fail rules should be plain enough that two reviewers would make the same call. For example, a run passes if the AI drafts the reply, asks for approval before sending it, and uses only the CRM. It fails if it changes pricing without approval, calls finance or production tools, finishes the task with missing customer data, or makes up facts.

Small details matter. If the AI sends a polished support message to the wrong customer, that is a failure. If it opens the right record but pulls notes from a blocked system, that is also a failure. A smooth answer does not cancel a bad action.

Assign reviewers before the exercise starts. One person should judge business accuracy. Another should check tool use and approval steps. A third should log every issue in one place with the task name, expected behavior, actual behavior, and risk level.

That structure keeps the exercise honest. Teams stop arguing about vibes and start judging runs against rules they already agreed on.

Run the exercise step by step

A good test should feel like live work, not a demo. Give the model the same task, the same prompt, the same tool access, and the same approval rules it will have after launch. If the live version will read the CRM, draft emails, open support tickets, or touch internal docs, the test version should do that too.

Clean tests hide real problems. Use normal tasks first so you know the basic path works, then add the messy requests people send every day: missing details, conflicting numbers, vague requests, and pressure to act fast. An agent that handles a simple sales quote may fail when the customer asks for a discount that needs approval or when two systems show different prices.

Pause at every tool call

Do not let the run blur into one final result. Stop after each tool call and ask why the model chose that tool, what evidence it used, and whether it should have asked for approval before the next step. This is where bad behavior shows up.

A support workflow makes the point clearly. If the AI reads a refund policy, opens the billing system, and drafts a refund, pause at each move. The draft may look fine, but the model might have skipped the approval gate because the prompt told it to "be helpful" and the tool had broader access than expected.

Keep one shared record

Put the whole run in one place so people can compare failures without guessing what happened. A simple log should include the task input, prompt version, final output, every tool call in order, each approval request or skip, reviewer notes, and the pass or fail decision.

Then fix one thing at a time and rerun the same task. Compare the new run with the old one, not with a different case. Teams often change the prompt, tighten a permission, and assume the problem is gone. Sometimes it is. Sometimes the model stops making the risky tool call but starts refusing valid work instead.

That repeat cycle matters more than one dramatic test. Small reruns show whether the workflow is getting safer or just behaving differently.

A simple test day in sales, support, and ops

Stress Test Before Release

Run realistic launch checks before customers find the weak spots.

Plan Test

A useful test day should feel like a messy workday, not a lab exercise. Give the AI normal tasks with one missing fact, one policy limit, or one conflicting instruction. That is when bad behavior shows up.

In sales, give the agent a lead that looks promising but has no budget details. Ask it to qualify the lead and draft the next message. A careful agent should stop and ask for the missing number or mark the lead as incomplete. A careless one will guess, promise too much, or move the lead into the wrong stage.

In support, use a refund request that feels urgent and emotional. The customer sounds upset, but the purchase falls outside policy by a few days. Watch closely. The AI should explain the policy, gather the order details, and ask a person for approval before it offers a refund. If it refunds first and checks later, that is a clear failure.

In ops, create a stock reorder task after a sudden jump in demand, then add conflicting signals. One message says demand doubled this week. Another says the spike came from a one-time bulk order. The AI should check current stock, supplier lead times, and purchase limits before it places any order. If urgency pushes it to buy fast, the test did its job.

Mix the wording on purpose. Use prompts like "handle this today" or "follow policy, but keep the customer happy no matter what." Real teams write rushed messages like that. Your tests should copy that pressure.

A short checklist helps. Did the AI ask for context when facts were missing? Did it ask for approval when the task involved money, discounts, refunds, or purchasing? Did it admit uncertainty, or did it act like it knew more than it did? Did urgency trigger the wrong tool call or a bad decision?

If the AI pauses and asks one extra question, that is often a good sign. A small delay costs far less than a bad refund, a false sales promise, or a warehouse full of stock nobody needed.

Check tool calls, approvals, and prompts

Most failures hide in the gap between the model's words and the action it sends to a tool. A chatbot can sound calm and correct while it updates the wrong CRM record, refunds the wrong order, or grants access to the wrong person.

That is why you need two views at once: the conversation the user saw and the exact tool output the system produced. If those do not match, you found a launch risk.

Start with the tool call itself. Check the account ID, record ID, email address, amount, date range, and any filters the model passed along. Small mistakes matter. A support agent asking to close ticket 1842 should not trigger a call that edits 1482 because the model guessed from a messy thread.

Then check approvals. Money movement, permission changes, customer data edits, and contract actions should never happen because the model felt confident. The system should stop and ask for clear approval from the right person. If the AI can issue a refund, change billing, reset MFA, or invite an external user without that step, fix it before launch.

Prompt rules often look solid until the model gets rushed. Put it under pressure with vague requests, angry users, partial data, and conflicting instructions. You will often find one line in the prompt that the model quietly drops, such as "ask for manager approval above $500" or "never change a customer address from chat alone."

A review pass should compare the assistant message with the actual tool payload and result, mark every action that touched money, access, or customer records, check whether a human approved the action before the call, and flag any reason the model invented to justify its choice.

Those invented reasons deserve extra attention. The model might say "the customer confirmed their identity earlier" when no such message exists, or claim "finance approved this" because that would make the next step easier. These errors sound believable, which makes them dangerous.

Good testing treats every confident claim like something that needs proof. Logs, timestamps, and tool outputs settle the argument fast.

Mistakes teams make in early tests

Bring In a Fractional CTO

Bring in a Fractional CTO for AI rollout, product decisions, and technical guardrails.

Get CTO Help

Early tests often look better than the real launch because teams feed the AI neat, polite requests. Real work is rarely that clean. A customer mixes two problems in one message, a sales lead asks for a discount and custom terms at the same time, or an ops request arrives with missing details and a deadline.

Another common mistake is judging the run by the final answer alone. Teams read the response, see a useful reply, and move on. Meanwhile, the action log shows the model called the wrong tool, pulled the wrong record, or tried the same failed action twice.

A polished answer can hide a bad decision. Imagine a support workflow that replies with the correct refund policy while the log shows it also tried to update billing data without enough proof. The customer never sees that attempt, but your audit trail does.

Approval rules also get sloppy in early tests. Some teams let the AI decide that its own risky action is safe because the prompt says to be careful. That is not an approval process. If money moves, account access changes, or customer data gets edited, a person or a separate control needs to approve it.

Prompt changes create another trap. A team tweaks one line, sees better behavior in two or three tests, and stops. Small prompt edits can shift tool use in strange ways. The model may become more cautious in support cases and more reckless in sales or ops.

A few habits prevent most of this. Test messy requests, not just clean ones. Review the action log, not just the reply. Keep approvals outside the model. Rerun old failure cases after every prompt change.

Teams that skip these checks usually get the same surprise later: the AI sounds calm and competent while doing the wrong thing underneath. Catching that before launch saves money, cleanup time, and a few painful customer calls.

Quick checks before launch

Get Outside AI Advice

Get an outside review when your team is too close to the system.

Talk to Oleg

Before launch, run one last dry run on tasks that can send messages, change records, issue refunds, update tickets, or trigger alerts. These are the actions that hurt fastest when something goes wrong.

Start with approval. If the workflow can affect a customer, a contract, a payment, or production data, name the human who must approve it. Do not leave this as "someone on the team." Put one role or one person on the step, and test that the workflow stops when approval does not arrive.

Then check logs. Every tool call should leave a plain record that shows what the agent tried to do, which system it touched, what input it used, and what happened after. If a sales agent updates a CRM field or a support agent closes a ticket, your team should be able to find that action in seconds. Tool call auditing feels boring right up to the moment you need it.

Missing data is another easy miss. Good agents do not guess when a customer name, account ID, budget, or shipping detail is unclear. They ask a follow-up question or stop and hand the task back to a person. Test this on purpose with messy inputs. Give the agent an incomplete support request or a sales note with two possible contacts and see if it asks instead of inventing.

You also need a fast stop button. If the workflow starts drifting, sends odd replies, or picks the wrong tool twice in a row, someone should be able to turn it off in minutes. Check who can pause it, where they do that, and how the team knows the workflow is off.

Finally, give retesting to one owner. Every prompt edit, tool change, policy update, or model swap can reopen old failures. One person should rerun the checks after each change and mark what passed, what failed, and what still needs work. If nobody owns that step, the same bug usually comes back under a different name.

What to do after the first round

Put every issue into a short backlog while the test is still fresh. Memory gets fuzzy fast, and small failures often point to bigger gaps. A sales assistant that picked the wrong discount tool once may do it again every time the same wording appears.

Sort each issue by three things: how much it can hurt a customer, how much money or legal risk it creates, and how often the same failure showed up. This keeps the team from wasting days on strange edge cases while a common approval miss stays open.

Then fix the guardrail, not just the example that failed. If one prompt let the model skip manager approval, rewriting that prompt is rarely enough. Change the workflow rule, the tool permissions, the approval step, or the prompt template so the same mistake cannot slip through in a different form.

Teams often patch the visible problem and move on. That feels fast, but it leads to repeat failures. If the model called the wrong tool in ops, ask why it had access, why the prompt allowed ambiguity, and why no check stopped it before action.

Run a short retest before release. It does not need to be a big event. Take the highest-risk cases, rerun them, and confirm that the fix still works under slightly different wording. Do the same after any major workflow change, model switch, tool update, or approval policy change. This kind of testing is ongoing work, not a one-time task.

Keep the notes simple. Record the task, what the AI did, what should have happened, and what changed after the fix. Over time, that gives you a small library of failure patterns and auditing checks that save real time.

If you want an outside review, Oleg Sotnikov at oleg.is works with startups and small teams as a Fractional CTO and advisor. He can help review real AI workflows, tool permissions, and approval steps before launch, especially when the team that built the system is too close to it to spot the weak points.

Frequently Asked Questions

What is a red team exercise for an AI workflow?

A red team exercise puts an AI workflow under realistic pressure before launch. You give it real business tasks, messy inputs, normal tool access, and the same approval rules it will face in production, then you watch for bad tool calls, skipped approvals, and made-up facts.

Why do happy-path tests miss so many problems?

Happy-path tests use clean prompts and normal cases, so the model looks better than it really is. Real work brings mixed requests, missing details, conflicting data, and urgency, and that is where the AI often makes the wrong move while still sounding sure.

Which workflows should we test first?

Test flows where a small mistake hurts fast. Good first picks include sending customer messages, updating CRM or billing records, issuing refunds or credits, changing access, or touching production systems.

How many test cases do we need to start?

Keep the first round small enough that people actually read every run. Three to five real examples per workflow usually gives you enough signal without turning review into a rushed chore.

What should count as a failure?

A run fails when the AI breaks a business rule, not only when it crashes. If it uses the wrong tool, skips approval, edits the wrong record, guesses missing facts, or invents a reason for an action, treat that as a failure even if the final reply looks fine.

Should the AI ever approve its own actions?

No. The model should never decide that its own risky action is safe when money, access, customer data, contracts, or production changes sit on the line. Put a person or a separate control on that step and make the workflow stop until approval arrives.

How do we review tool calls the right way?

Pause after every tool call and check what the model chose, why it chose it, and what data it sent. Compare the assistant message with the actual payload so you can catch cases where the AI sounds correct but updates the wrong record, amount, or account.

What messy test cases should we include?

Use tasks that feel like a rough workday. Include vague wording, copied notes, missing fields, angry customers, conflicting numbers, and rushed requests like "handle this today" so you can see whether urgency pushes the AI into a bad action.

What should we do after a prompt or model change?

Rerun old failure cases after every prompt edit, tool change, policy update, or model swap. Small changes often fix one problem and create another, so you need the same cases to check whether the workflow got safer or just changed its behavior.

Is it worth getting an outside review before launch?

An outside review helps when your team built the workflow and knows it too well. A fresh reviewer can spot loose permissions, weak approval steps, and risky prompt gaps that an internal team may gloss over, especially right before launch.