Oct 23, 2025·7 min read

AI pair programming for legacy code without breakage

AI pair programming for legacy code works best when you fence risk, record current behavior, and keep each change small enough to trust.

AI pair programming for legacy code without breakage

Why old code feels dangerous

Old code feels risky because the code you see is rarely the whole story. A small edit in one file can change behavior in places that do not look connected.

Billing and account logic are classic examples. Someone renames a field or tweaks one condition, and the app still compiles. Then a report total shifts, an export changes format, and a customer email shows the wrong number. Nothing in the diff looked dramatic, but the behavior changed anyway.

Names make this worse. In older systems, variables and methods often carry business rules nobody wrote down. A method called applyFix, legacyMode, or specialCase might hide a discount from years ago, a workaround for one large customer, or a tax rule that only runs at month end. The name tells you almost nothing, but the code still decides real money, real dates, and real permissions.

Missing tests add another layer of fear. When nobody has captured current behavior, every change becomes guesswork. You can read the code, run the app, and still miss what users depend on every day. Even careful developers end up asking the same question after each edit: did I fix the bug, or did I just move it?

That is why AI pair programming helps and also creates risk. It can scan a messy module quickly, summarize branches, suggest test cases, and draft a small patch in minutes. It can also spread a bad assumption just as quickly. If it misreads one rule, it can repeat that mistake across helpers, tests, and refactors before anyone catches it.

Legacy code refactoring feels dangerous because the code is not just old. It is full of hidden promises to the business. Some of those promises live in comments. Some live in support tickets. Many live only in behavior. Until you pin that behavior down, even a clean-looking change can break something people quietly rely on.

Decide what you will not change

The fastest way to create risk with AI pair programming for legacy code is to give the model a vague goal like "clean this up." Old code needs a fence first. You want a small work area, not an open field.

Pick the exact problem, then name the code that stays off limits. That usually means whole files, helper functions, background jobs, and unrelated user flows. If you only need to fix an invoice rule, leave authentication, email templates, admin filters, and CSS alone.

Write that boundary down in plain language. A short note beats a clever prompt.

  • The change starts when the user opens the invoice edit screen.
  • It ends when the updated total is saved and shown.
  • PDF export does not change.
  • Tax logic for other countries does not change.

That kind of note keeps both you and the model focused. It also makes review easier because everyone can see what belongs in the diff and what does not.

Draw the line around one user path

Legacy systems usually break through side effects, not through the part you touched on purpose. Define the exact user path you need to fix. Name the screen, action, input, and expected result. If the path is too wide to describe in a few lines, it is probably too wide to change in one pass.

This is also the moment to reject broad cleanup. Leave the tempting stuff alone: renamed variables, moved files, prettier formatting, swapped patterns, and "while we are here" rewrites. Those edits feel harmless, but they make review noisy and hide the risky part inside cosmetic churn.

Keep the first pass boring

Boring is good here. It means you can compare before and after without guessing. Ask for the smallest code change that fixes the one path you defined. If the model suggests extra refactors, cut them out and park them for later.

Style work can wait. Mixed naming, stale comments, and ugly spacing are annoying, but they rarely hurt users today. Fix behavior first. When the bug is gone and the path is stable, then decide whether the code still needs cleanup.

Capture current behavior first

Old code often survives because it does something the business quietly depends on. The names are messy, the flow looks wrong, but invoices still close, exports still match, and one strange branch keeps a customer happy. If you skip behavior capture, you are guessing.

This step matters more than the prompt. The model can suggest cleaner code fast, but it cannot know which ugly result is the one your users expect.

Run the feature exactly as it works today. Feed it a handful of real inputs and keep the outputs. Use cases the team recognizes: one normal case, one with missing data, one that looks wrong but people accept anyway, and one that caused support questions before.

For each example, save the input, the exact output or screen result, some proof such as a log line, API response, or screenshot, and a short note on why the case matters.

Then turn those examples into tests around current behavior. Do not wait for pretty test helpers or a clean design. A blunt test with hardcoded fixtures is often better than a clever test you never finish. The goal is simple: if the result changes later, you want a failing test, not a surprise from a customer.

Odd cases deserve their own notes. Maybe the system rounds tax up for one old customer group. Maybe an empty middle name creates a double space in a PDF, and nobody wants to touch that before quarter close. Write those details down. Teams forget them fast.

Keep comparison material outside the test suite too when it helps. Screenshots catch layout drift. Logged responses catch field changes. Sample JSON files catch tiny format shifts that people miss in review.

A simple example makes this concrete. If an old invoice rule turns "net 30" into a due date that skips weekends only for one region, save three or four real invoices and their current dates before you touch the code. Then ask the model to refactor the calculation in tiny steps. You review the diff with something solid in hand instead of trusting that cleaner code must be correct.

Work with AI one step at a time

Legacy code gets risky when you ask AI for too much at once. A full rewrite looks efficient, but it hides bad guesses, missed edge cases, and changes nobody can review with confidence. In practice, small prompts beat ambitious ones almost every time.

Start with one narrow question. Ask the model to explain a single function, branch, or strange condition. If the answer sounds vague, that tells you something useful: the code is harder to reason about than it first looked, and you should shrink the scope again.

Before the model edits anything, make it describe what the code does now. Ask for plain-English behavior, likely inputs, and places where a change could leak into other parts of the system. That step often catches hidden assumptions early. It also gives you a baseline for review.

Then ask for tests that lock in current behavior, even if that behavior looks odd. Old code often contains rules that exist for a reason nobody remembers. A good prompt is less like "fix this module" and more like this:

  • Explain what this method does, line by line.
  • Suggest tests that lock in current behavior.
  • Change only the discount rule inside this method.
  • Show the diff and explain each edited line.

After that, make one code change and run checks right away. Run the tests, linting, type checks, and the smallest manual check you can do. If something fails, stop there. Do not stack another prompt on top of a shaky result.

Review each diff line by line before you ask for the next change. This matters more than the prompt itself. AI often writes one correct line next to one risky cleanup that nobody asked for. Reject side changes, rename less, and keep refactors small enough that one person can read the whole diff in a few minutes.

A boring rhythm works best: read, explain, test, change, check, review. It feels slower. Usually it saves time because you spend less of it untangling a clever edit that touched six things at once.

Keep each refactor narrow enough to review

Get Outside Eyes Early
Use a short consultation to spot side effects before a bigger cleanup.

Old code punishes ambition. The safest refactor is usually the dull one: one rename, one extracted function, one moved block. If a reviewer cannot explain the diff in under a minute, the change is too wide.

Renames are a good first pass because they change meaning without changing behavior. Rename one variable or one function, run the checks, and stop there. When you bundle renames with logic edits, reviewers have to track two moving targets at once.

A small billing example makes this obvious. Changing total to invoiceTotal is easy to review. Changing that name while also changing tax math and moving code to another file is where bugs hide.

When a function does too many jobs, split it before you move it. Pull out one small helper, keep the same inputs and outputs, and leave the call order alone. After tests pass, move that helper in a separate change if the new location still makes sense.

AI pair programming for legacy code works better when you ask for one kind of edit at a time. Ask the model to rename a method, or extract a pure helper, or remove dead comments. Do not ask it to clean up a whole file unless you want a diff full of noise.

A safe AI coding workflow is simple: rename one thing, run tests or compare current output, extract one small function without changing results, and move logic only after that extracted version passes. If you still want cleanup, open a separate change for it.

Stop when the diff gets hard to read. That point arrives quickly, especially when the model starts touching formatting, imports, comments, and logic in one pass. A long diff can still be correct, but it is harder to trust.

Cleanup deserves its own change. Reordering imports, deleting dead code, and fixing style issues may be good work, but those edits should not sit next to behavior changes.

A simple example: changing an old invoice rule

An old billing method can be 200 lines long, mix tax math with discounts, and still handle money every day. Few teams want to rewrite that just to add one exception.

Say a company needs a small change: invoices for one partner program should skip a local tax, while every other invoice should stay the same. The safest move is not a cleanup pass. It is to lock down what the code already does.

The team starts with five real invoices that accounting already approved. They pick examples that cover the messy corners of the method: a normal monthly invoice, one with a discount, a refund or credit note, a manual line-item adjustment, and a customer that already has a tax exemption.

Next, they use AI to write tests around those records. The prompt is plain: take these inputs, run the current billing method, and assert the exact subtotal, tax, total, and rounding. This is a good fit for AI pair programming because the setup is repetitive, while the team still checks that every expected number is right.

Once those tests pass on the untouched code, they add one more invoice for the new rule. This case belongs to the partner program, so the local tax should be zero. Everything else should keep working exactly as before.

The code change stays small. The team adds one condition near the tax branch and leaves the rest alone. They do not rename methods, move logic into new files, or tidy unrelated code. That kind of cleanup feels productive, but it makes review harder and riskier.

Before release, they compare old and new results. The five older invoices should produce the same totals they produced before. The new invoice should differ in one place only: the local tax line.

This is why narrow refactors are easier to trust. The team does not rely on a gut feeling. They have proof that the old behavior still holds, and proof that the new rule changed only what it was meant to change.

Mistakes that create false confidence

Protect Billing and Account Logic
Review risky rules before a tiny edit turns into a customer issue.

The most dangerous moment in old code work is not when the diff looks messy. It is when everything looks clean, modern, and easy to approve. That feeling tricks teams into merging changes they do not really understand.

A common mistake starts with an oversized prompt. Someone asks AI to modernize a whole module, clean naming, simplify conditionals, and update tests in one go. The output may look neat, but the request is too broad to review with care. In legacy code refactoring, a pretty rewrite often hides a behavior change nobody noticed.

Another trap is trusting the diff because it reads well. Clean style does not mean correct logic. Old code often contains odd checks, repeated branches, or awkward variable names because somebody hit a production problem years ago and patched around it. If you delete the strange part before you know why it exists, you can remove the one thing that kept edge cases working.

Teams also lose track when they mix several kinds of change at once. A bug fix, a naming cleanup, and a small rewrite may each be reasonable on their own. Put them in one change set and review gets blurry fast. When a test fails, nobody knows which part caused it. When a bug slips through, rollback gets harder than it should be.

Generated tests can create false comfort too. AI often writes tests that only repeat the new code path. The test passes because it mirrors the same assumption the code now makes. That does not capture current behavior. It only proves the new version agrees with itself.

A small billing example makes this obvious. Suppose an old function keeps a weird rule for invoices created on the last day of the month. The model removes it because it "looks redundant" and writes tests for regular dates only. The suite stays green. Production breaks the next time that edge case appears.

Trust grows when changes stay narrow, tests check old behavior first, and reviewers read logic line by line. If the update feels too easy to approve, pause and ask what you still have not explained.

Quick checks before you ship

Plan AI First Delivery
Get help with code review, testing, automation, and a practical delivery process.

A small change in old code can look harmless right up to the moment it hits real traffic. Before you merge anything, pause and make sure the edit is easy to prove, easy to review, and easy to undo.

Use a short release gate. The tests tied to the change should fail before the edit and pass after it. Old sample inputs should still return the same results unless you meant to change them. You should be able to explain the update in two or three plain sentences. Another developer should be able to read the diff in one sitting. You should also be able to roll back by reverting a small set of files, not by untangling a pile of unrelated cleanup.

A quick example helps. Say you changed an old invoice rule so discounts stop applying to canceled orders. Good evidence looks like this: one test shows the old wrong behavior, the new code makes that test pass, and three older invoice samples still produce the same totals they produced yesterday.

Plain language is a strong check. If you cannot explain the change clearly, you probably changed too much.

What to do next if the code still fights back

When a module still feels hostile after a few careful edits, stop trying to fix everything in one pass. Old code usually fights back for a reason: too many side effects, hidden rules, weak tests, or business logic that nobody wrote down.

Keep a simple list of risky areas and return to it after each small win. Note which files or functions feel unsafe, what might break if they change, which behavior still has no test, and what the smallest next edit could be. Treat that list as a map, not a backlog you have to clear this month.

A good rule is boring and effective: every time you touch the module, add one more test. One test for a strange discount rule. One test for a null input. One test for the old branch that looks wrong but still matches production behavior. Over a few weeks, that adds up.

This is where AI helps most. Use it to read confusing files, explain branches in plain language, draft tests that capture current behavior, and suggest very narrow edits. Do not ask it to rewrite the whole module because it looks messy. Messy code with known behavior is often less risky than neat code with new bugs.

If your team keeps circling the same hot spots, an outside review can save time. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor, and this kind of review can help a startup or small business set practical guardrails before a larger cleanup.

Sometimes a short consultation is enough. A second set of eyes can stop the kind of rewrite that passes code review, misses one old edge case, and breaks production on a normal Tuesday.