Jul 19, 2024·8 min read

Dry-run mode for AI workflows before they touch records

Dry-run mode for AI workflows lets teams test real-shaped inputs, catch bad updates, and review edge cases before any live record changes.

Dry-run mode for AI workflows before they touch records

Why live records cause trouble

Live data breaks faster than most teams expect. One bad run rarely creates one bad record. It can change hundreds in minutes, especially when an AI workflow sorts tickets, updates CRM fields, or assigns work on its own.

Small mistakes spread fast. If a prompt mishandles an edge case, the workflow can tag the wrong customers, create duplicates, move people into the wrong pipeline stage, or hand accounts to the wrong owner. At first the damage looks small. Then someone opens the dashboard and sees the same problem everywhere.

The cleanup is the part teams underestimate. A test run might take 10 minutes. Undoing it can eat half a day, and even longer if other automations react to the bad data. Sales teams start calling the wrong leads. Support agents answer from the wrong queue. Managers stop trusting the reports because nobody knows which changes were real.

A support team testing a triage flow on live tickets can run into this fast. The model reads urgent refund requests as low priority, adds the wrong tag, and assigns 240 tickets to a general queue. Now someone has to find every affected ticket, restore the owners, fix the tags, and check whether any later rules already fired. The cleanup takes much longer than the original test.

This gets riskier in smaller companies, where one workflow often touches several systems at once. A single run can update the CRM, open tasks, send notifications, and change reporting totals. That is why teams need a safe path before live writes. A dry run lets people test with inputs that look like production data, catch odd cases, and stop bad changes before they land in real records.

What a dry run should do

A dry run is a rehearsal. The workflow reads input that looks real, runs the same logic, and shows what it would do next. The simple difference is that nothing actually changes.

The system can still classify a ticket, extract fields from a form, choose a branch, and prepare updates as if it were live. You still see the decisions. You can check whether the prompt, rules, and model output make sense before any real record changes.

The important part is blocking side effects. A proper dry run does not write to the database, send emails or chat messages, create tasks, or push updates into outside tools. If the workflow normally calls a CRM, billing system, or help desk, dry run mode should stop the final update and keep it as a preview.

That matters because many failures are subtle. The model picks the wrong category. It fills the wrong field. It tries to notify the wrong person. In live mode, one mistake can spread fast. In dry run mode, the same mistake stays visible and harmless.

A good test still needs evidence. Teams need to see a full action log, a clear diff of the planned changes, the raw input, the model response, and the exact step where something failed. Without that detail, the run turns into guesswork. You might know the workflow failed, but not why. Worse, you might think it passed when it quietly skipped an edge case.

The best dry runs feel almost boring. They behave like production, use real shaped test inputs, and block every write at the edge. That gives teams a clean way to test strange cases, fix the logic, and rerun the same input until the output looks right.

Where writes usually hide

Teams usually catch the obvious writes first. They block database inserts, skip updates, and turn off file uploads. That is a good start, but it is rarely enough.

A dry run fails when even one side effect slips through. The workflow may leave your main table alone and still create noise somewhere else. Database changes are only part of the picture. Writes can also show up in file storage, cache entries that trigger later jobs, search indexes, analytics tools, and audit logs that feed other systems.

The hidden writes cause the most trouble because they sit outside the main code path. A model finishes its task, then a helper service fires a webhook. A retry worker sees a temporary error and sends the request again. A queue job wakes up a minute later and creates a record even though the original run was only a test.

This happens all the time with messaging and customer systems. A dry run can still send an email, post to Slack, open a support ticket, create a CRM contact, add a note to a sales deal, or push an event into analytics. None of that looks like a database write when you glance at the workflow diagram, but it still changes real systems.

One missed side effect can ruin the whole test. Imagine a support triage flow that correctly blocks ticket updates in your app but still sends a webhook to the help desk. Now agents see fake cases, customers get replies they should never receive, and the team stops trusting the test environment.

Retries make this worse. If your code blocks the first write but leaves the retry path active, the system may attempt the same action again through another worker or service account. What looked like a clean test turns into a confusing bug hunt.

When you map writes, trace the full path after every model decision. Check direct writes, delayed jobs, outbound calls, notifications, and any service that reacts to events. If a run can change another system tomorrow, treat it as a write today.

How to build it

Start at the entry point, not in the middle. If a workflow can begin from an API call, a queue, or a scheduled job, each path should set the same dry_run flag before anything else happens. That one choice keeps the whole run honest.

Dry run mode usually breaks when the flag stops at the first layer. The model sees test data, but a tool call, helper script, or retry job still writes to the database. One missed handoff is enough to create real records.

A practical build usually follows five steps:

  1. Add one dry run flag where the workflow starts.
  2. Pass that flag into prompts, tool calls, and async jobs.
  3. Route every write through one adapter layer.
  4. Swap real write methods for dummy or preview versions when the flag is on.
  5. Save one report that shows what the workflow tried to do.

That adapter layer matters more than people expect. If your code writes directly to a database in five different places, dry run mode becomes guesswork. Put record creation, updates, deletes, message sends, and webhook posts behind clear functions, then switch those functions based on the flag.

Make the report useful

A dry run should leave evidence, not side effects. Good reports show planned actions, field diffs, validation failures, tool outputs, and any error that would have stopped a live run.

Keep the report easy to scan. A reviewer should understand in a minute or two that the workflow would have updated Customer A, skipped Customer B, and failed on Customer C because a required field was missing. If the report is messy, people stop reading it closely.

Background jobs need the same care. If the first step is safe but it schedules a later task without the flag, that second task can still write data hours later. Pass the flag through job payloads, event messages, and retries every time.

Run a small batch before you scale the test. Ten or twenty realistic cases usually reveal enough: odd names, empty fields, duplicate records, and conflicting states. Then review the output with the people who know the workflow best.

That is usually where the hidden bugs show up. Someone notices that a note would be posted twice, a status would change too early, or a low confidence answer would still trigger a handoff. Fix those first. Then test a larger batch.

Pick test inputs that look real

Audit Your AI Workflow
Find where your workflow still emails, updates, or creates records by accident.

A dry run only tells the truth if the input looks like the data your team actually sees. If your workflow expects customer_email, ticket_body, and order_id, your test records should use those same field names and the same structure. Placeholder fields create weak tests because they skip the small mismatches that break real runs.

Clean the private parts, but keep the mess. Replace names, emails, phone numbers, and account numbers with safe values. Keep the odd spacing, broken capitalization, empty fields, pasted signatures, and mixed date formats. That is where AI workflow testing gets useful.

A short, tidy sample set is rarely enough. Real systems get long messages, half filled forms, duplicate notes, and records that disagree with each other. One field says "refund approved" while another says "do not refund." That kind of conflict is exactly what you want to catch before the workflow can write anything back.

A good starter set usually includes normal records, records with missing fields, very long text, conflicting values across fields, and ugly formatting copied from emails, chats, or old systems.

Keep these inputs in a small saved library. When one case breaks the workflow, do not just fix the prompt and move on. Save that record, label what failed, and run it again after every change. Over time, that becomes your safety set.

One practical rule helps a lot: test for structure first, then content. Structure means the record has the same shape as production data. Content means it includes the strange cases people create by accident every day. You need both.

If you work with startup teams or internal operations, ask for five to ten recent records and sanitize them. That usually gives better coverage than inventing fifty fake ones. Real inputs carry the hidden quirks: old field names, optional values that are not really optional, and text that looks harmless until the model reads it the wrong way.

If the test set feels slightly ugly, it is probably closer to the truth.

A support triage example

A good test case is one messy support email, not a clean ticket. Imagine a customer writes, "I upgraded this morning, your system charged my card twice, I still cannot sign in to the new account, and the export button crashes when I try to download last month's data."

That single message mixes three issues that often land in different queues. It has an account problem, a billing problem, and a product bug. This is where a dry run pays off. The team can feed realistic input into the triage flow and inspect every decision while writes stay blocked.

Example output

The AI reads the message, classifies it, and drafts the next action. In a dry run, the system can still show the result it would send to your help desk or CRM:

  • Tags: billing, account access, export bug
  • Owner: product support
  • Priority: medium
  • Draft reply: "Thanks for reporting this. Please try signing out and back in. We are checking the export issue."

Nothing gets created. No ticket, no account note, no status change, and no customer reply.

What the test catches

That output looks fine for about two seconds, and then the problem becomes obvious. The owner is wrong. A duplicate charge needs a billing person fast, and the priority should be higher because the customer paid and still cannot log in.

Because writes are blocked, the team can fix the logic before it touches the CRM. They might add a rule that says any message with both "charged twice" and "cannot sign in" goes to billing first, with an internal note for product support about the export bug.

After that change, the next dry run should look different. The tags can stay the same, but the owner changes to billing operations, the priority moves to high, and the draft reply confirms the payment issue, asks for the invoice number, and tells the customer that the team already recorded the export problem for review.

This example is small, but it shows the mistake that hurts most in live support. The AI sounds helpful while sending the case to the wrong place. Good testing should catch that quiet error early, before a customer waits six hours and your team has to untangle the record trail.

Mistakes that ruin the test

Build a Safer Rollout
Start with one workflow and a release plan your team can actually follow.

A dry run fails when it looks safe but still changes something. Teams often block writes in the main app and forget the side paths around it. A worker, retry job, webhook, or scheduled task can still update records long after the first request ends.

That mistake is common because writes rarely live in one place. The model returns a label, but another job stores it, sends an email, updates a CRM field, or opens a ticket. If even one of those paths stays live, the test leaves a mess behind.

Another common failure is fake data that is too clean. Real users paste long messages, broken phone numbers, strange date formats, all caps text, copied signatures, and half finished forms. Tidy samples hide the exact formatting problems that break prompts, parsers, and routing rules.

Logs matter just as much. Without them, nobody can compare the expected result with the actual one. You want a clear record of the input, the prompt version, the model output, the blocked action, and the final decision the system would have made. Otherwise, people argue from memory, and memory is bad at edge cases.

One smooth run proves almost nothing. A workflow can pass once and still fail on retries, duplicate events, empty fields, or long inputs. Run the same cases more than once. Then run the ugly cases on purpose.

Third party services are another trap. You may block writes to your own database and still call live systems by accident. That can send customer emails, create support tickets, burn API budget, or trigger follow on automations outside your control.

A simple rule helps: if a test can change data, notify a person, or charge money, block it first and log what would have happened instead.

Checks before live mode

Bring In Fractional CTO Help
Work through release order, risk points, and approval steps with an experienced CTO.

Put the mode switch in one place. One flag should decide everything: live writes on, or dry run only. If one job reads DRY_RUN and another checks a different setting, the test stops being trustworthy.

That single switch needs to reach every action that can change something outside the workflow. Database inserts are the obvious part, but they are rarely the only part. Email sends, ticket updates, webhook calls, CRM changes, file uploads, and background jobs all need to read the same mode before they act.

The rule should stay simple. In dry run mode, the system can read, classify, score, draft, and log, but it cannot create, update, send, or delete. Teams usually miss one side path, like a retry worker or a fallback script. That is often where the mess starts.

Before you turn on writes, review one full run from start to finish. Look at the prompt, the model output, the tool calls, the diff between input and planned action, and the error log. You want to see what the AI tried to do, what your code allowed, and where those two things did not match.

Use fresh edge cases, not the easy samples you started with. Re run the strange ones from the last few weeks: missing fields, duplicate records, angry customer messages, partial uploads, and conflicting instructions. If the workflow handles those cleanly in dry run, you have a better reason to trust it live.

Keep the approval step boring and clear. One person should own the final yes. In a small company, that may be the CTO, founder, or the engineer who understands the full workflow end to end. The point is not status. The point is that one person checks the evidence and accepts the risk.

A simple final check works well: did one flag control the run, did every external action obey it, did the logs make sense, did recent edge cases pass, and did a named owner approve live mode? If any answer is no, wait a day and fix it.

Start with one workflow and grow

Pick one workflow where a bad write hurts, but the scope is still small enough to watch closely. That gives your team a clean place to learn. If you try to wrap the whole stack in dry run mode at once, people miss hidden writes, skip edge cases, and stop trusting the test.

A good first target usually has clear inputs, a visible result, and a real chance of doing damage if it goes wrong. Support routing, refund approval, lead scoring, and ticket tagging all fit. Payroll changes and contract updates are riskier, and they usually need tighter controls before they are safe to test this way.

Before you add retries, background jobs, or extra automation around that workflow, add dry run mode first. You want one plain path that takes production shaped input, runs the same logic, and stops before any record changes. If the base path is fuzzy, more automation only hides the problem.

A useful first choice usually has three traits: people already review the output by hand, the workflow touches important records, and the team can test it many times in a week. That keeps feedback fast and makes mistakes easier to spot.

When the same failure shows up twice, save it as a permanent test case. Do not treat edge cases as one off annoyances. Save the exact input shape, the expected behavior, and the reason it failed. After a month, your dry run suite stops being a demo tool and starts acting like memory for the system.

Then grow in order. First prove the workflow can run safely without writes. After that, add logs and alerts. Only then add the surrounding automation. It is a boring rollout order, and that is exactly why it works.

If your team needs a careful rollout plan, Oleg Sotnikov at oleg.is can review write boundaries, failure paths, and release order as part of his Fractional CTO advisory work. That kind of outside check is often useful when several systems touch the same records.

Frequently Asked Questions

What is a dry run in an AI workflow?

A dry run lets your workflow read real-shaped input, run the same logic, and show what it would do without changing any real records. You can inspect tags, owners, field updates, and drafted replies before anything lands in your CRM, help desk, or database.

Why not test on live records with a tiny batch?

Small live tests still create cleanup work when the model makes one wrong decision and other automations react to it. Even ten bad updates can spread into wrong tags, wrong owners, noisy alerts, and broken reports.

What should a dry run block?

Block every side effect, not just database writes. Stop emails, chat messages, webhooks, task creation, CRM updates, file uploads, retries, and any background job that can change another system later.

Where do hidden writes usually sneak in?

Hidden writes usually slip through helper services, retry workers, queue jobs, fallback scripts, and third-party integrations. If one step can send data out or wake up another system, treat it like a write and route it through your dry run check.

What should I log during a dry run?

Log the raw input, prompt version, model output, tool calls, blocked actions, field diffs, and any validation or runtime error. That gives your team enough detail to see what the workflow tried to do and where the logic went off track.

How do I add dry run mode without rewriting the whole system?

Start at the workflow entry point and set one dry_run flag there. Then pass that flag into every tool call and job, and route all writes through one adapter layer so you can swap real actions for preview actions.

What kind of test data should I use?

Use sanitized records that keep the same field names, structure, and messy content your team sees in production. Keep odd spacing, long text, missing values, duplicate notes, and conflicting fields, because those cases usually break prompts and routing rules.

How many records should I test first?

Begin with ten to twenty realistic records. That is usually enough to catch bad assumptions, then you can fix the logic and run a larger batch with more ugly cases.

How do I know a workflow is ready for live mode?

Check one full run from start to finish and make sure one flag controlled every external action. If the logs make sense, recent edge cases pass, and one named owner approves the release, you have a decent reason to turn on writes.

Which workflow should I start with?

Pick one workflow that touches real records and already gets human review, like support routing, refund approval, lead scoring, or ticket tagging. Start there, prove that dry run mode catches mistakes, and only then add more automation around it.