Mar 13, 2026·7 min read

Refactoring without tests using a safer assistant workflow

Refactoring without tests gets risky fast. Learn how traffic samples, snapshots, and small scopes help you change code with fewer surprises.

Refactoring without tests using a safer assistant workflow

Why this gets risky fast

Old code rarely fails in obvious places. It breaks in the odd branch added for one customer, one file format, or one timeout that only appears under load. The code may look clumsy, but that awkward shape often hides rules nobody wrote down anywhere else.

Legacy systems also pick up habits over time. A function trims spaces because an old import sent broken values. A status field accepts two spellings because another service never got fixed. Someone reading the code today sees mess. Users and other systems may depend on that mess.

That is why refactoring without tests gets risky so quickly. You are not just cleaning up names and formatting. You are touching behavior tied to tiny details like field order, empty values, retry timing, or when the code returns early.

An AI assistant makes this faster, which helps and hurts at the same time. It can split a long method, rename vague variables, and remove duplicate blocks in minutes. It can also "improve" a condition, merge two cases that looked equal, or move a side effect just enough to change the result. The diff looks cleaner. The behavior still shifts.

Old code often hides rules in one-off branches for past customers, quiet fallbacks for malformed input, side effects inside logging or retries, and odd output quirks that another system now expects.

Missing tests leave you with no baseline. If an endpoint returns strange JSON, is that a bug or an unofficial contract? If a batch job skips bad rows without raising an error, is that wrong or required? Without checks, every cleanup pass turns into guesswork.

Large refactors make this worse because they mix cleanup with logic changes. If you rename variables, extract helpers, reorder conditions, and swap parsing code in one pass, you lose the trail. When something breaks, nobody knows which edit caused it.

On production systems where uptime matters, silent behavior drift costs more than ugly code. Messy code is annoying. Hidden rule changes are expensive.

Start with one narrow slice

Scope is your first safety check. Pick one thing people can point to and verify: a single API endpoint, one background job, or one screen with one clear task. Small beats tidy.

A good slice has a clear start and finish. "This endpoint takes a customer ID and returns the current invoice summary" is small enough. "Clean up the billing module" is how breakage spreads into places you never meant to touch.

A narrow slice also gives your assistant firm limits. Tell it to edit only the files needed for this endpoint, keep the same response shape, and ignore unrelated cleanup. That usually means fewer random renames, fewer helper rewrites, and fewer surprise side effects.

Good first slices include one GET or POST endpoint, one scheduled import or export job, one page with one form submission, or one report generator with fixed output.

Once you pick the slice, freeze its input and output before you change code. Save a few real requests and responses, or write down the exact screen state before and after one action. Keep one normal case and one awkward case. Legacy code usually breaks on weird input, not the happy path.

Then write a short note on what must stay the same. Keep it plain: status codes and error messages, field names and null behavior, rounding and sorting, date formatting, and side effects such as emails, logs, or created records.

That note keeps the work honest. It also gives the assistant a contract to follow.

Leave the rest alone for now. If you touch a shared helper and then start fixing every caller, the slice is gone. Resist the urge to rename half the folder, move files around, or clean up everything nearby. Code outside the slice can stay messy for one more day.

That restraint feels slow, but it usually saves time. One safe edit that preserves behavior is better than a broad cleanup that sends you digging through production logs at 11 p.m.

Capture real behavior before edits

Before you clean anything up, collect evidence. You need a small record of what the code does today, even if some of that behavior looks wrong.

Start with real traffic samples, not made-up examples. Production requests, support tickets, old logs, and saved payloads usually reveal the edge cases hand-written test data misses. Pick a small set and keep it readable.

Use three kinds of input every time: normal input, empty input, and messy input. Normal input shows the common path. Empty input shows what happens when fields are blank or missing. Messy input catches the stuff that breaks refactors: extra spaces, wrong casing, partial records, duplicate fields, odd date formats, and unexpected nulls.

For each sample, record the input, the returned value or response body, the status code or error type, and any side effects. That last part matters more than people expect. A function may return the same JSON after your edit but stop writing a log row or skip a billing event. If you do not capture side effects, you only protect half the behavior.

Store a snapshot before you touch the code. Keep the samples and current outputs in a simple folder, a fixture file, or a lightweight snapshot setup. Do not wait for a full test suite. A cheap snapshot is often enough to tell you that the assistant changed something you did not mean to change.

Say you are cleaning up a customer import function. Save five real CSV rows: one clean, one with an empty email, one with two phone numbers, one with broken spacing, and one with a duplicate customer ID. Then record what the system does now. Does it return 200 or 422? Does it trim names? Does it create one customer, two, or none? Does it write an error row for review?

That record becomes your safety net. You are not proving the code is correct. You are freezing current behavior so you can change structure without guessing. Once the refactor is stable, you can decide which old behavior to keep and which bugs to remove on purpose.

Build cheap checks first

Cheap checks beat perfect tests when you need to change old code. If the code has no safety net, build a small one from the behavior you already captured.

Start with snapshots around the parts users actually touch. Save the current response from one endpoint, one report, or one background job output. You do not need broad coverage yet. You need a fast way to notice that something changed before a bug reaches production.

Raw snapshots create noise. IDs change. Timestamps move every run. Random ordering can make identical results look different. Parse the output first, then compare only the fields that matter.

Suppose an old billing endpoint returns JSON with invoice_id, generated_at, customer totals, tax, and line items. Your check should keep the totals, tax rules, and item shape. It should ignore invoice_id and generated_at if those always change. That gives you a stable snapshot instead of a false-alarm machine.

Stable comparisons matter more than fancy tooling. A plain script that normalizes output and compares it to a saved file is often enough.

Slow dependencies can wreck this workflow. If every run waits on an external API, a queue, or a large database call, you will stop using the checks after the second or third edit. Stub those parts early. Return a fixed response for the payment provider. Use a tiny fixture instead of the full production dataset. Keep the run short enough that you use it without thinking.

A cheap check should be narrow, stable, fast, and easy to update when the expected output changes. If updating the snapshot feels like surgery, the check is too heavy.

You are not proving the whole system is correct. You are building a tripwire. For assistant-led refactoring, that is often the difference between a clean small change and an afternoon spent guessing what broke.

Use a boring refactoring loop

Review Legacy Code Before Changes
Find the branches, fallbacks, and side effects that still matter.

Boring beats clever here. You want a repeatable loop that catches drift early, before one "small cleanup" turns into a production bug.

Start by asking the assistant to explain the current code path in plain English. What inputs does it read? What output does it return? What side effects does it cause? Which parts look risky? If the explanation is fuzzy or misses obvious behavior, do not ask for edits yet.

Then ask for one small change only. Good requests are narrow enough that you can compare before and after behavior in a few minutes. Extract one repeated block into a helper without changing output. Rename one confusing variable inside a single function. Split one long condition into named checks and keep the same logic.

Avoid prompts like "refactor this file" or "clean up this module." Those invite taste-based changes you cannot verify.

After every edit, run the same traffic samples or snapshots you captured earlier. Check returned data, status codes, rendered text, logs, and visible side effects. If one sample changes and you did not expect it, stop there.

When behavior drifts, narrow the prompt instead of arguing with the result. Limit the assistant to one function, one branch, or one repeated block. Tell it to keep output byte-for-byte the same and avoid renames outside the touched area.

Small commits matter more than people think. Commit each safe step before asking for the next one, even if the change feels trivial. A stack of tiny commits gives you clean rollback points and makes review much easier.

A normal session might look like this: explain the handler, extract one helper, run snapshots, commit; rename two local variables, run snapshots, commit; remove one dead branch, run snapshots, commit. It feels slow for about ten minutes. Then it starts saving hours.

Five safe commits usually beat one ambitious rewrite.

A realistic example

Picture a checkout service with one messy function called calculateCheckout. It applies coupon codes, member discounts, tax rules, and a free-shipping threshold in one long block. Nobody trusts it, and nobody wants to touch it because there are no tests.

This kind of code tempts people to clean everything in one pass. That is where breakage starts. A smaller move is usually safer than a prettier rewrite.

Start by pulling ten real carts from recent traffic. Pick a mix that reflects normal behavior: one simple cart, one with a coupon, one with an expired coupon, one with tax-exempt items, one that barely qualifies for free shipping, and a few ugly edge cases. Strip out personal data, then save each cart payload so you can run it again.

For each sample, record what the system does today. Keep the snapshot plain: final total, tax amount, discount lines applied, and any error messages shown to the user.

Now choose one rule only. Maybe the function has a branch that says "apply 10% off when the cart contains category X and the subtotal is over $100." Ask your assistant to extract just that branch into a helper like applyCategoryDiscount(cart) and leave the rest alone. Do not rename half the file. Do not combine rules yet.

Run the ten saved carts again. Nine may match exactly. One might change because the helper now rounds a discount at a different step, which also changes tax by a few cents. That is exactly the kind of difference you want to catch before you keep cleaning.

If the outputs still match, you earned the next move. Extract one more rule, rerun the same carts, and keep going in short loops. If the outputs do not match, undo the change or fix that branch before you touch anything else.

This is also how teams working on lean systems with strict uptime requirements avoid careless regressions. Oleg often uses the same approach in larger production work: capture real behavior first, then change one narrow piece at a time.

Mistakes that cause avoidable breakage

Plan Your Next Refactor
Get a practical review before you change fragile code in production.

The fastest way to break legacy code is to ask the assistant for a full cleanup and accept a giant diff. When tests are thin, a rewrite hides lots of behavior changes inside nicer names and shorter files.

Another common mistake is mixing style fixes with logic edits. Renaming variables, reordering functions, changing conditionals, and swapping APIs in one pass makes review much harder. If something fails later, nobody knows whether the bug came from formatting noise or real behavior drift.

Pretty code proves almost nothing. A formatter can make a messy file look calm while a tiny boolean change flips who gets charged, emailed, or locked out. Readable diffs help, but checks based on real behavior matter far more.

Snapshots help only if they stay clean. If they include timestamps, random IDs, shuffled order, or other noisy fields, people stop trusting them and click "update" without reading. Strip or freeze random values first so snapshot failures mean something.

Teams also miss the period right after release. They merge, see green linting, and move on. Then logs fill with new errors, support tickets mention odd output, and real users hit the branch nobody sampled.

If you already have error tracking or server logs, use them on purpose for the first release after a refactor. Watch for spikes in exceptions, slower requests, and repeated complaints around one screen or action. Those signals often catch breakage faster than code review.

A safer pattern is simple: ask for one narrow change at a time, keep style-only edits separate from behavior edits, compare outputs against saved traffic samples or stable snapshots, and check logs and user feedback after the code ships.

That feels slower, but it usually saves time. One broad prompt can create a weekend of rollback work. Four small prompts, reviewed against real inputs, are easier to trust and much easier to undo.

Quick checks before you merge

Get Fractional CTO Help
Work with Oleg on risky legacy changes before they hit production.

The last review before merge matters more than people admit. A clean diff can still change behavior in small, expensive ways. Spend a few minutes checking the parts that usually break first.

Start with the evidence you collected before the edit. Run the same traffic samples, the same snapshot checks, and the same manual inputs. Do not ask whether the new code looks better. Ask whether each sample still produces the same business result.

A short merge check works well. Compare outputs from before and after. Totals, status values, formatted fields, and side effects should still match where they are supposed to match. Read the changed error paths carefully. Failures often drift during cleanup, especially around nil values, retries, timeouts, and validation. Scan logs from the refactored slice. New warnings, extra retries, or noisy stack traces usually mean you changed more than you thought. Make sure you can roll back fast. One small commit, one feature flag, or one isolated file change is much safer than a wide rewrite. Finally, write down what you still did not check. Gaps are fine if you name them clearly and keep the scope small.

Error handling deserves extra attention because assistants often tidy it up in ways that look harmless. A helper may now swallow an error, return a different message, or skip a retry. Users may still reach the same screen, but billing, audit trails, or downstream jobs can change.

Logs help catch this. If the old code produced one warning and the new code produces twenty, that is not cosmetic noise. A branch now triggers more often, or a fallback path no longer works.

Rollback speed matters too. If you cannot undo the slice in a minute or two, the slice is probably too large. Small reversals beat brave debugging.

Leave a short note with the merge. Mention which samples you reran, which error paths you checked, and what still lacks coverage. That note helps the next person decide where to be careful, and it helps you when the same area comes up again next week.

What to do next

After one safe cleanup, do not jump to a bigger rewrite. Pick the next small slice that looks similar to the one you just finished, and run the same process again.

That rhythm matters more than speed. Steady repeats beat one ambitious pass that touches ten files and changes behavior by accident.

A simple routine works well: choose one path with clear inputs and outputs, capture a few real traffic samples before editing, keep snapshots of current behavior, make the assistant change only that narrow area, and compare results before you merge.

When a snapshot keeps matching through several small changes, turn it into a real test. Start with the parts that cost the most when they break, like billing math, permission checks, import jobs, or customer-facing text formatting.

Do not wait for a perfect test suite. A handful of focused tests becomes useful much faster than a grand plan that never gets finished.

If your team is doing this kind of legacy cleanup regularly, it helps to have someone set the rules for how assistants are used, what gets checked, and where refactors should stop. That is the kind of practical AI-first engineering and Fractional CTO work Oleg writes about on oleg.is, especially for small teams trying to modernize without breaking production.

The goal is simple: preserve behavior first, clean structure second, and build tests as you go. That is how old code gets safer instead of just prettier.

Frequently Asked Questions

Can I safely refactor old code if I have almost no tests?

Yes, but keep the scope small. Start with one endpoint, one job, or one screen, capture real inputs and outputs, and change one narrow part at a time. If you refactor a whole module at once, you turn every edit into guesswork.

What should I refactor first?

Pick the slice that people can verify fast. A single API route, one form submission, or one report works better than a broad area like billing or user management. Clear inputs and outputs make it much easier to spot drift.

How many traffic samples do I need before I start?

You do not need many at first. Five to ten real examples usually give you a useful baseline if you include one normal case, one empty case, and a few ugly cases from logs or support tickets. Real traffic beats invented examples almost every time.

What counts as messy input?

Messy input means the stuff old systems quietly learned to accept. Think extra spaces, mixed casing, missing fields, duplicate values, odd date formats, partial records, or null in places you did not expect. Those cases often break first during refactoring.

Are snapshots enough, or do I need full tests first?

Snapshots work well as a first tripwire. Save the parts that must stay the same, such as totals, field names, status codes, and error text, then ignore noisy values like timestamps or random IDs. Later, turn the stable ones into real tests.

How do I stop an assistant from changing behavior during a refactor?

Give the assistant tight limits. Ask it to explain the current behavior first, then request one small edit inside one function or branch. Tell it to keep output the same and avoid unrelated renames, file moves, or helper rewrites.

Should I mix formatting changes with logic changes?

No. Keep style edits and logic edits separate so you can review each change with less noise. If you rename variables, reorder code, and change logic in one diff, you make the real risk much harder to spot.

What should I check right before I merge?

Run the same samples you captured before the edit and compare business results, not just code style. Check returned data, status codes, error paths, logs, and side effects like emails, records, or retries. Also make sure you can roll back fast if something slips through.

What if the new code only changes the output a little?

Treat any output change as a real signal until you prove otherwise. A small shift in rounding, field order, or error handling can break billing, imports, or another service. Either fix the refactor to match the old behavior or mark the change as intentional and review it on purpose.

When should I turn snapshots into real tests?

Start as soon as one snapshot or sample keeps passing through a few small refactors. Good first targets include billing math, permission checks, import jobs, and customer-facing output. You do not need a grand test plan; one stable check at a time works fine.