AI coding tools in legacy codebases: boundaries first
AI coding tools in legacy codebases work better when teams set file limits, refactor rules, and blocked areas before any trial starts.

Why old code gets worse faster with AI
Old code carries habits no team would choose on purpose today. You see vague names, helper functions doing three jobs, copied conditions, and business rules hidden in odd places. An AI model reads those patterns as normal and produces more of them.
That is why AI can make a legacy system messier very quickly. The model cannot tell which strange pattern is a real rule and which one came from a rushed fix three years ago. It predicts the next likely code, so it copies the repo's habits, good and bad.
Speed makes this easy to miss. Fresh files often look neat at first glance, and reviewers feel progress because so much code appeared so fast. Then the problems show up: duplicate logic with slightly different names, vague comments, and one more utility function that almost matches five older ones.
The risk grows when a team lets the tool refactor shared code too early. A weak change in one common module rarely stays local. It spreads through imports, tests, error handling, feature flags, and small assumptions across the codebase. By the time someone notices, the team is sorting through a wide diff instead of fixing one bad edit.
Legacy systems also break in places teams forget to test. People check the happy path because it is quick and the demo still works. They miss the awkward cases that kept the system alive for years: partial updates, retry loops, old data formats, timezone quirks, and rules tied to one large customer or one old integration.
AI does not create that risk from nothing. It multiplies what is already there. If the repo has mixed styles, hidden business logic, and weak tests, the tool spreads those problems faster than a human usually can. It feels productive right up to the moment cleanup starts eating the time the team thought it saved.
Decide where AI can write code
A trial goes wrong when the tool can touch the whole repo. Draw the map first. Teams need clear write zones before the first prompt, not after the first bad merge.
Most teams should start in boring areas. That sounds limiting, but it saves time. If the model writes a weak test helper, you can fix it in minutes. If it edits billing rules or auth checks, you might spend a week tracing damage.
Good starting areas are usually test folders, narrow utility files, internal scripts, admin tools that do not affect customer flows, and generated adapters from stable schemas. Write these areas down by folder path, not by fuzzy category. "/tests/integration" is clear. "Helpers" is not.
Blocked areas need the same level of detail. In most legacy systems, that means billing and pricing logic, database migrations, authentication, permissions, public API contracts, and code that handles sensitive or regulated data.
Old systems also hide business rules in odd places. A file may look like a simple formatter but quietly decide tax, access level, or renewal dates. If the team is not sure, treat that area as blocked until someone traces the real behavior.
Each allowed area needs a human owner. One person should review prompts, inspect diffs, and decide whether the area stays open for AI edits. Shared ownership sounds fair, but it usually means nobody notices when weak code starts piling up.
A simple rule works well: one owner per folder, a short allowed list, and expansion only after a few clean review cycles. It feels strict at first. It is cheaper than cleanup.
Set refactor rules before the trial
AI tools love to tidy things up. They rename methods, split files, reorder imports, and rewrite code that was never part of the task. In a legacy codebase, that turns a small bug fix into a pull request nobody wants to review.
Start with a hard rule: the tool can make tiny refactors only after the existing tests pass. If tests are red, or if that area has no tests, keep the shape of the code as it is. Fix the bug or add the small feature first. Cleanup can wait.
A safe week one policy is simple:
- Keep each pull request focused on one change.
- Leave files where they are.
- Skip broad renames, extracted modules, and import rewrites.
- Keep behavior the same unless the ticket asks for a behavior change.
These limits save time. Reviewers spot real changes faster when the diff is small. If a release breaks something, the team can roll back one pull request instead of untangling ten mixed edits.
Large renames and file moves deserve a hard ban in the first week. AI often treats them as harmless cleanup. They are not harmless in old systems. A rename can break logs, scripts, dashboards, handwritten SQL, or code that depends on string based lookups. A file move can confuse build steps and ownership rules nobody documented.
Ask engineers to state the intent in plain words inside the pull request: "fix rounding bug in invoice total" or "add missing null check in customer export." That sounds basic, but it keeps both the AI and the reviewer on the same task. When the stated goal is narrow, extra refactors are easier to reject.
One more rule matters: preserve old behavior by default. Legacy systems often contain strange logic for a reason, even if that reason is ugly. If the team wants different behavior, write a separate ticket and review that change on its own. Mixing cleanup with behavior changes is how a pilot turns into a month of repair work.
Block risky domains on day one
Some parts of the system should stay off limits at the start. You do not need to ban the tool everywhere. You do need clear red zones where a human keeps full control.
Start with anything tied to money, access, or irreversible data changes. Legacy systems hide old rules in strange places, and generated code can miss them while sounding very confident.
A simple test helps: if a mistake can charge the wrong amount, grant the wrong access, or destroy data, block AI from editing that area unless a senior engineer gives direct approval.
Billing code needs special care. Old products often carry years of exceptions: discounts for one customer group, tax handling by region, contract terms that changed halfway through a quarter. AI can make that code look cleaner and still break the rule that pays your invoices.
Security deserves the same caution. A generated refactor might move an authorization check, remove a guard that looked repetitive, or widen access in a way that does not stand out in a quick review. Keep humans writing and approving those changes until the team knows exactly where the tool helps and where it drifts.
Deletion flows and schema edits need their own gate because they are hard to undo. If AI suggests dropping a column, changing a type, or rewriting a migration, stop and treat that as a separate decision. Do not hide it inside a larger cleanup branch.
For customer calculations, keep a small bank of sample cases. Use real examples from invoices, refunds, prorated plans, and rounding edge cases. Run them before merge, and make review depend on those results.
This sounds strict because it is. It still costs less than fixing one bad merge in a blocked area.
Run the first trial in small steps
A pilot works best when it feels almost boring. Pick one low risk service, give it to one small team, and keep the scope tight enough that a reviewer can read every change without rushing.
Good starting points are internal tools, small background jobs, or a thin API with clear tests. Skip login, payments, security logic, data migrations, and anything that can break trust fast. The first win is control, not speed.
Put the rules on one page. If they do not fit on one page, the team will not use them. That page should answer four basic questions: which folders the tool may change, which files need human approval before merge, whether the tool may refactor old code or only add isolated changes, and what reviewers must reject on sight.
Keep the team small. Two to four engineers is enough. One reviewer should own the final call so standards do not drift after day three.
Track one simple number during the trial: how often reviewers reject AI written changes. If half the drafts come back because the code is hard to read, names are sloppy, or the tool quietly touched unrelated files, that tells you more than raw output ever will.
Set a hard stop after two weeks. Then inspect the merged changes by hand. Look for warning signs: repeated helper functions, diffs that grew wider than expected, comments that explain obvious code, and tests that pass without saying much.
An engineering lead or fractional CTO can spot the pattern quickly by reading ten merged changes in a row. If the code still looks like your code, the trial can grow. If every change needs cleanup, stop and tighten the rules first.
A simple example from a billing flow
Billing is where AI can save time or create expensive mistakes.
One team started with an old invoice module that everyone avoided. It still worked, but nobody trusted it enough to change it quickly. Instead of letting the AI touch invoice logic, they gave it a narrow task: write tests around the module.
That made the trial useful without putting money movement at risk. The AI generated test cases for late fees, invoice totals, and odd date formats from past customer records. Some tests were rough, but reviewers could fix weak assertions in minutes.
During review, the team found something more important than the tests. The same date rules showed up in three different files. One file calculated due dates, another formatted reminder dates, and a third adjusted invoice dates for exports. Each version looked slightly different. That kind of duplication is easy for an AI tool to copy again, which is how a small mess becomes a bigger one.
So they drew a hard line. The AI could not edit tax rules or payment retry logic at all. Those areas carried too much business risk and too many hidden edge cases. A bad change there would not just break code. It could charge the wrong amount or retry a failed payment at the wrong time.
After that review, the team cleaned up date handling by hand. They pulled the shared logic into one helper, checked it against real invoices, and made the tests pass again. Only then did they widen the AI's scope, and even then the permissions stayed narrow: add or update tests, suggest changes to helper functions, and refactor only when reviewers approve the exact files first.
That policy is boring, and that is why it works. The team got faster in a part of the codebase nobody wanted to touch, but they did not let the tool wander into the parts that decide taxes, retries, or payment state.
Mistakes that turn a pilot into cleanup work
The worst pilots usually fail for boring reasons. A team gives the tool too much freedom, merges pretty diffs too quickly, and notices the damage only after support tickets start piling up.
The first mistake is scope. If AI can edit every folder on day one, it will wander into old scripts, abandoned helpers, and code nobody fully understands anymore. A small write area beats a wide one every time.
The next mess comes from large refactors. An AI tool can return a clean diff that looks smarter than the code it replaced. That does not make it safe. In old systems, a rename, file move, or shared helper rewrite can break parts of the product far from the original task.
Style cleanup makes this worse. Teams ask for a small fix, then accept a patch that also reformats files, renames variables, and reshapes functions. Review gets harder immediately. When behavior changes hide inside cosmetic edits, nobody can tell which line caused the bug.
Rollback plans are another weak spot. If a merge touches a sensitive path and the team has no quick way back, a pilot turns into late night repair work. Before you merge risky output, decide who can revert it, how fast they can do it, and what signal tells them to stop the rollout.
Speed can fool people too. If a team measures success only by tickets closed or hours saved, it can miss the real cost. One fast week means very little if the next two weeks disappear into bug hunts, noisy diffs, and lost trust.
A useful pilot needs a few checks in place: limit edits to a small set of files, reject broad refactors during the test, separate behavior changes from cleanup, make rollback steps clear before merge, and track defect rate, review time, and rework instead of output alone.
This is where technical leadership matters. A good policy feels strict at first, but it saves far more time than it costs.
Quick checks before you expand
A pilot can look healthy and still leave quiet damage behind. Expansion should wait until the boring checks look clean, because that is where most trouble appears first.
The real test is not whether the tool produced code quickly. It is whether your team can review, test, and contain that code without extra confusion.
A simple go or no go check works well. Reviewers should understand each diff in a few minutes. Tests should cover both the old weird cases and the new path that changed. Track reversions every week. Check blocked areas against actual pull requests. Make one named owner approve every risky change.
The revert count matters more than people think. One rollback is normal. A pattern of rollbacks means the tool is creating work the team notices only later, after review fatigue sets in.
Diff size also tells the truth fast. When AI changes fourteen files to adjust one behavior, reviewers miss things. Small, narrow edits are a better sign than clever rewrites.
Tests deserve skepticism too. Green checks do not mean much if the suite skipped the odd cases your system still carries. A billing rule from 2017, a handmade date fix, or a customer specific exception can break while the happy path still passes.
If two or more of these checks fail, keep the pilot small. Fix the process first, then expand.
What to do next
After a short pilot, write the guardrails down. If the rules live in chat threads or in one manager's head, people start guessing. That is when small exceptions turn into a new mess.
A short team policy is enough. Name where AI may write code, which files stay off limits, what counts as an acceptable refactor, and when a human must take over. Keep it plain. If a new engineer cannot read it in five minutes, it is too long.
Most teams only need four parts in that policy:
- allowed file types and folders
- blocked domains such as billing, auth, and security checks
- review rules for AI made diffs
- prompt rules, including examples of requests the team should not use
Reviewers need training too. Many teams focus on prompts and forget diff quality. A reviewer should reject vague requests like "clean this up" or "make it better" because they often produce broad changes with weak intent. They should also reject diffs that mix renames, refactors, logic changes, and formatting in one pass. Those changes waste hours in review and hide bugs.
Do not revisit blocked areas every day. That only creates pressure for exceptions. Put them on a monthly review instead. Look at incidents, review pain, test coverage, and rollback data. If one blocked area now has solid tests and clear ownership, open a narrow path. If it still causes surprises, keep it closed.
Some teams can set this up alone. Some need an outside review, especially when legacy code touches revenue, uptime, or customer data. Oleg Sotnikov at oleg.is works with startups and small to medium businesses as a fractional CTO, and this kind of boundary setting is exactly where experienced technical review pays off. A short review of write zones, blocked domains, and approval rules can prevent months of quiet cleanup later.
The next step is deliberately boring: write the policy, train reviewers to reject weak prompts and messy diffs, and put one date on the calendar for the first monthly boundary review.
Frequently Asked Questions
What should AI edit first in a legacy repo?
Start with low-risk areas your team can review fast, like tests, internal scripts, admin tools, and small utility files with clear behavior. Keep AI away from billing, auth, permissions, migrations, and any code that can charge money, expose data, or delete records.
Why can AI make old code worse so quickly?
AI copies the patterns it sees. If your repo has rushed fixes, vague names, and hidden business rules, the model will repeat them at speed. That gives you neat-looking diffs now and more cleanup later.
Should I allow AI refactors in the first week?
No. In the first trial, ask AI for narrow changes only. Let engineers handle renames, file moves, and shared module rewrites by hand until the team trusts the tests and review process.
Which parts of the system should stay off limits on day one?
Block anything tied to money, access, or irreversible data changes. That usually means billing and pricing rules, authentication, permissions, public API contracts, schema changes, and sensitive customer data flows.
How big should the first AI coding pilot be?
Keep it small enough that one reviewer can read every diff without rushing. One low-risk service, one small team, and a two-week window works well for most teams. Pick control over speed for the first run.
What should reviewers reject right away?
Reject vague requests like "clean this up" and diffs that mix logic changes with formatting, renames, or file moves. Also reject changes that touch unrelated files or hide the actual intent of the ticket.
How do I know the pilot is actually working?
Watch the reject rate, revert count, diff size, and cleanup work after merge. If reviewers understand each change in a few minutes and the team rarely rolls back AI edits, the process probably works.
Can AI help with billing without touching billing logic?
Yes. A safe way to use it there is to generate tests and sample cases around invoices, refunds, proration, rounding, and date handling. Let humans keep control of tax rules, retry logic, and payment state changes.
When should we expand AI write access?
Expand only after a few clean review cycles. Open one folder at a time, keep one named owner for that area, and make sure tests cover the old weird cases before you widen access.
When does it make sense to ask a fractional CTO for help?
Bring in outside help when the team cannot agree on boundaries, or when revenue, uptime, or customer data sit near the trial area. A short review from an experienced fractional CTO can tighten write zones, blocked domains, and approval rules before small mistakes spread.