Jan 18, 2026·8 min read

Evidence trail for AI code changes your team can trust

Build an evidence trail for AI code changes with repo snapshots, saved prompts, tool logs, and test results so teams can explain risky edits later.

Evidence trail for AI code changes your team can trust

Why assistant edits cause later confusion

A code diff shows what changed. It rarely shows why the assistant chose that path.

Two weeks later, a bug shows up in billing or auth. The team can see added lines and deleted lines, but not the prompt, the assumption, or the tradeoff behind them. Review slows down right away. Someone asks, "Did the assistant update this file because it found a dependency, or because the prompt named it?" Without that context, people guess. A five-minute review turns into a long thread and extra checking.

The problem gets worse when nobody knows what the assistant read before editing. If it looked at one migration file, three service files, and an old test helper, that matters. Reviewers need to know whether the change came from a broad read of the codebase or from a narrow slice that missed a dependency.

Test results can mislead teams too. A green run today only proves that the tests selected today passed. It does not tell you whether someone skipped an integration test yesterday, changed a fixture, or never ran the billing suite at all. The screen says "pass," but the missing history stays hidden.

Small teams feel this fast. One engineer merges an assistant-made refactor on Saturday. Another investigates a production issue on Monday. Nobody can retrace the exact path. Repo state, prompt, files read, commands run, and test output sit in chat windows and terminal scrollback.

When that trail disappears, every risky edit gets harder to defend, harder to review, and harder to roll back with confidence.

What an evidence trail should include

A good trail should answer two questions months later: what changed, and why the team accepted it.

If an assistant edits billing logic, auth rules, or deployment files, nobody should have to guess what the model saw or what it touched. Start with the repo state before the edit. Save the starting commit hash, the branch name, and the working tree status. That small snapshot matters more than people expect. If the repo already had local changes, a strange result later may have nothing to do with the assistant.

Next comes the instruction itself. Keep the exact prompt, task text, or ticket summary that guided the change. Paraphrases are weak evidence. A one-line rewrite like "fix billing bug" drops the limits, risks, and assumptions that shaped the edit.

You also need a record of actions, not just the final diff. In practice, that means the shell commands the assistant ran, tool calls such as search or code edits, the files it created, renamed, or deleted, and the test output, warnings, and skipped checks.

One human note matters just as much. It can be short: "We accepted this change because it removes duplicate tax logic and all billing tests still passed." That gives future reviewers context a log file cannot.

For risky work, keep the raw test output, not just "tests passed." Failing test names, warnings, and skipped suites show what the team did and did not verify. If a production issue appears later, that detail can cut hours from the search.

With that record in place, the edit stops looking like a mystery patch. It becomes a traceable decision with a starting state, a clear request, visible actions, proof of checks, and a human reason for merge.

Capture the repo state first

Start by taking a snapshot of the repository before the assistant touches anything. That one habit prevents a lot of arguments later. If a risky edit causes a bug, the team needs to know exactly what code the assistant saw when it began.

Record the current commit hash, the branch name, and the task ID from your tracker. A short note like "branch: billing-refactor, task: FIN-184, base commit: 7f3a..." is enough. It gives every later prompt, file edit, and test run a clear starting point.

Then check for local changes. Uncommitted files matter because they change the assistant's context, even if nobody plans to keep them. A half-finished migration, a tweaked config file, or a local feature flag can make an edit look sensible on one machine and wrong everywhere else.

Config changes deserve their own note. Teams often forget about local .env values, debug switches, or machine-only settings. If the assistant updated billing logic while a tax flag was enabled on one laptop only, reviewers need that fact in the trail.

Unrelated work is where the record usually gets messy. If the branch already contains a CSS cleanup, a logging tweak, and one backend fix, the assistant's changes blend into the noise. Stash the extra edits, move them to another branch, or commit them separately before you ask the assistant to do anything.

A small snapshot usually covers enough:

  • base commit hash
  • branch name and task ID
  • modified and untracked files
  • local config differences
  • unrelated work removed from the branch

Without that starting point, every later record is weaker because nobody can prove where the edit really began.

Save the prompt and the intent

After a risky edit, teams usually ask the same question: "Why did the assistant do that?" If nobody saved the original prompt, people fill the gaps from memory, and memory gets messy fast.

Keep the exact prompt text. Do not replace it with a cleaned-up summary later. Small wording changes matter. "Refactor billing for clarity" can lead to a very different result from "reduce duplicate billing logic without changing invoice totals or retry behavior."

Next to the prompt, save one plain sentence that states the goal. Keep it short enough that a reviewer can read it in five seconds. For example: "Remove duplicate billing checks in the invoice flow without changing charges, emails, or refund rules."

That goal sentence works best when it sits next to hard limits. Write down what the assistant must not touch, even if it feels obvious at the time. That can include files, folders, tables, public APIs, or user-facing text.

A short record usually covers enough:

  • the full prompt text
  • one sentence stating the goal
  • limits such as "do not edit migrations" or "leave refund logic alone"
  • every follow-up prompt that changed scope or direction

Those follow-up prompts matter more than most teams expect. The first prompt might ask for a small cleanup, then a later message says, "also merge the retry paths" or "update tests to match the new flow." That second instruction often explains the risky part of the change.

If the assistant needed three rounds to get there, save all three. Later, when a reviewer sees a surprising edit in a billing file, they can tell whether the assistant drifted on its own or someone asked for a wider change. That difference matters when you decide to merge, roll back, or tighten the process.

Log tool calls and file actions

Fix Review Blind Spots
See where your team loses context in AI edits before bugs reach production.

A code diff shows the result. It does not show how the assistant got there, and that missing part matters when a change touches billing, auth, permissions, or data migration.

Two edits can look similar in Git and still carry very different risk. One assistant may read the right files, run a narrow search, update one function, and stop. Another may scan half the repo, rename files, create helpers, rerun tests, then patch errors by trial and error. Reviewers should know which path happened.

Keep a simple log next to the change record. It should show the exact commands the assistant ran, which files it opened before writing anything, any file it created, deleted, or renamed, and the order of those actions. Timestamps help because sequence matters.

If the assistant renamed billing_rules.ts, then edited tests, then created a migration, that sequence tells a reviewer far more than a final diff alone. If it wrote to a file before reading the related validation code, that is a warning sign worth noticing.

Manual follow-up edits need a label too. If a developer fixes one line after the assistant finishes, mark it clearly with a timestamp and name. Without that note, the record breaks and later blame lands on the wrong step.

This does not need fancy tooling. A plain text log or structured JSON is enough if it stays consistent. The point is simple: someone should be able to reconstruct the session without guessing.

A practical rule works well. Do not allow a write action without a matching read record, and do not merge without a time-ordered action log. It adds a few minutes now, but it can save hours when a risky edit needs an explanation three weeks later.

Keep test output with the edit

A test result means little if nobody knows how you got it. Save the exact command, the branch or commit, and the environment that ran it. "npm test" is too vague when the real command was "CI=true pnpm test --filter payments". Note the runtime too: Node version, database image, feature flags, seed data, and whether the run happened locally or in CI.

Green runs help, but red runs often explain more. Keep the first failing output, not only the clean run after fixes. If a reviewer sees that a change first broke two checkout tests and then passed after a schema fix, the edit feels easier to trust. Without that record, the diff can look random.

Attach a short test record with the change. It should include the exact test, lint, type check, or build commands that ran, the first failure output or at least the error lines and test names, any reruns for flaky tests, skipped suites with a plain reason, and the final passing run tied to the same commit.

Be honest about flakes. A flaky test does not disappear because the third run passed. If the assistant changed code that touches async jobs, network calls, or timing, reruns matter. They show whether the code is stable or whether luck got it through review.

Skipped suites need the same treatment. If you skip browser tests because the edit only changes a backend parser, say that. If you skip them because the test environment is broken, say that too. Reviewers can judge the risk when they know what was not checked.

Lint, type check, or build output can matter just as much. A small UI text change may not need a full build log. A refactor that touches shared types, generated code, or deployment files usually does. Short evidence beats a vague "tests passed" every time.

Set up the process step by step

A usable trail should feel boring and easy. If people need five apps and ten manual steps, they will skip it the first busy week.

Start with one small template for every assistant-driven edit. Keep it short: the goal of the change, the prompt, the risk level, the files touched, and the tests run. That is enough for most teams, and it gives reviewers context before they start guessing why a strange edit exists.

A simple routine works well:

  1. Write the prompt and one or two lines of intent in a shared note.
  2. Run a script that saves the current commit hash and git status.
  3. Store the assistant output, tool-call log, and test output in the same folder.
  4. Add that folder name or ID to the pull request.
  5. Ask reviewers to read the trail first, then inspect the diff.

The capture script matters because memory fails fast. A tiny shell script can record the branch name, commit hash, timestamp, and uncommitted files in a few seconds. That snapshot often explains why the assistant changed more than expected.

Keep everything in one agreed place. A folder inside the repo, a build artifact, or a shared review system can all work. What matters is consistency. When logs live in three places, nobody checks them.

Reviewers should treat the trail as part of the change, not as extra paperwork. For a risky billing or auth edit, prompt and test history can show whether the assistant followed a careful request or wandered into the wrong files.

If the process starts to feel heavy, cut steps fast. A smaller process that everyone uses beats a perfect record that people ignore.

A simple example: risky billing refactor

Speed Up Safer Reviews
Set file boundaries, prompt rules, and test checks that fit a busy small team.

A team asks an assistant to rename billing fields from price to unit_amount across the checkout service, invoice service, and internal admin app. The saved prompt makes the intent clear: rename fields, keep behavior the same, and do not change billing rules.

The edit lands fast, and the pull request looks tidy. A week later, finance spots bad numbers in exported invoices. Some records use the new field name, but one export still reads the old one.

A good record makes the problem smaller. The repo snapshot from the moment the assistant started shows an older database migration still in progress. One table still had both columns, and the team had not finished the backfill. That matters because the rename touched several services at once.

Tool logs add the missing detail. They show the assistant opened the checkout mapper, the invoice API, and two billing jobs. It never opened the invoice formatter file that builds the CSV export. That tells reviewers something useful right away: the assistant changed upstream data names, but it never checked the last step where finance reads them.

The test output helps even more. Most billing tests passed, but the export test was skipped because a fixture was missing. That skip did not look serious during review. Later, it becomes the clearest warning sign in the whole change set.

Because the team kept the prompt, repo state, tool history, and test results together, they do not have to revert everything. They roll back the formatter file, restore the old field mapping for exports, and keep the safer rename work in the rest of the system. That saves time and avoids a wider rollback on code that was actually fine.

Mistakes that erase the trail

Teams rarely lose context because of one huge failure. They lose it through small habits that seem harmless while the work is fresh. A week later, nobody can explain why the assistant touched a sensitive file, what it saw before the edit, or whether the team accepted known risk.

The most common mistake is saving only the final diff. A diff shows what changed, but not what the repo looked like before the assistant started. If a branch already had local edits, generated files, or half-finished cleanup work, that missing context changes how reviewers read the change.

Another easy way to break the record is rewriting prompts after the fact. People do this when they want a cleaner note in the ticket. The problem is simple: the polished version is not the instruction the assistant actually used. If the edit later causes trouble, the team ends up investigating fiction.

Failed test output disappears even faster. Someone runs tests, sees two failures, fixes one, and keeps only the final green result. That hides useful facts. Early failures often show where the risky part of the change really was.

A single assistant session can also muddy the trail when it mixes two jobs. Maybe the original task was a billing fix, then someone adds a quick auth cleanup and a config tweak. Now the prompt, tool history, and test output describe several decisions at once. Review gets slower because nobody knows which evidence belongs to which edit.

Storage causes its own mess. If one person saves prompts in chat, another drops logs into a private note, and a third uploads screenshots to a ticket, the record is already fractured.

A clean process stays boring on purpose:

  • keep the repo snapshot with the change
  • save the exact prompt, not a rewritten summary
  • keep failed and passing test output
  • split unrelated tasks into separate assistant sessions
  • store logs in one agreed place every time

Boring records settle arguments fast. That matters most when the edit touched billing, auth, or production config.

Quick checks before merge

Plan Lean AI Development
Use Oleg's advisory to shape AI first software work without losing control.

A reviewer should not need chat history, memory, or guesswork to understand an assistant edit. A good trail lets someone open the change and answer a few direct questions in about a minute.

  • Is the starting repo state attached, with the branch name, commit hash, and any local diff that existed before the assistant touched the code?
  • Can the reviewer read the exact prompt, plus a short human note explaining why the change was requested?
  • Does each edited file map to a tool action, terminal command, or a short note for any manual fix?
  • Do the records show which tests ran, which failed, and which never ran?
  • Can the author explain the risky part in one sentence?

That last check is easy to skip, but it often tells you the most. A sentence like "This billing refactor may change retry timing and create duplicate charges" gives the reviewer a clear place to focus. A vague note like "updated payment flow" does not.

If any answer is no, pause the merge and fill the gap. Review slows down fast when people have to hunt through messages, ask around for the prompt, or guess whether a file changed because of a tool run or a manual edit.

Lean teams feel this even more. When a small group moves quickly with assistants, clean records replace hallway knowledge. The trail does not need to be fancy. It just needs to make the change explainable later, when a bug appears, a customer reports odd behavior, or someone new takes over the code.

Next steps for a small team

A small team can start with one risky change, not every tiny fix. Pick the next edit that touches billing, auth, data deletes, permissions, or a shared service. That is where confusion shows up later, and where a clear evidence trail pays off.

Keep the first version almost boring. If it takes more than five minutes, people will skip it when the day gets busy. A short template is enough: the repo snapshot before the assistant starts, the prompt and one sentence about the intended outcome, the tool calls or file actions that changed code, and the test output, even if one check failed.

That small record already answers most of the painful review questions: what changed, why it changed, what the assistant touched, and what the team checked before merge.

Run it once, then bring one saved trail to your next retro. Read it like someone new to the project would read it. If the intent is vague, tighten the prompt. If the test output is noisy, trim it to the parts that matter. If people forgot the repo state, make that field impossible to miss.

A realistic goal for the first month is simple: capture one trail cleanly from start to finish. After that, decide where it should live and who owns it.

If your team wants help shaping the process, Oleg Sotnikov at oleg.is advises startups and small teams on review flows, tooling, and lean CTO practices. That can help if you want AI-assisted development to move faster without losing control of risky edits.

Frequently Asked Questions

What should we save before the assistant edits code?

Save the commit hash, branch name, task ID, git status, untracked files, and any local config differences. That snapshot shows the exact starting point and stops later arguments about what the assistant actually saw.

Why should we keep the exact prompt instead of a short summary?

The exact prompt shows the limits and intent that shaped the edit. A cleaned-up summary often drops details like do not touch migrations or keep refund behavior the same, and those details matter when you review a risky change.

Do we need to log which files the assistant read?

Yes. Read history tells reviewers whether the assistant checked the right code before it changed anything. If it edits billing logic but never reads the export formatter or validation code, you can spot that risk fast.

What belongs in the action log?

Record commands, searches, file reads, writes, renames, deletes, timestamps, and any manual fix a developer makes after the assistant stops. That gives you a clear sequence instead of a mystery diff.

How much test evidence should we keep?

Keep the exact test commands, the first failures, skipped suites with a reason, reruns, and the final passing run for the same commit. Tests passed hides too much when nobody knows what ran.

Should a small team do this for every assistant change?

Start with risky edits first, like billing, auth, permissions, deletes, migrations, or shared services. Once the habit feels easy, you can expand it. A small process people actually use beats a bigger one nobody keeps up with.

Where should we store the evidence trail?

Put the prompt, repo snapshot, action log, and test output in one agreed place, such as a repo folder, build artifact, or review system. Keep them together and attach the trail ID to the pull request so reviewers do not hunt across chat, tickets, and terminal scrollback.

What if a developer fixes something after the assistant finishes?

Mark the manual change with the person's name and a timestamp. That keeps credit and blame clear, and it stops later confusion about whether the assistant or the engineer changed a risky line.

Can an evidence trail help with rollback?

Yes. A clean trail shows which files the assistant touched, what it never checked, and which tests the team skipped. That makes it easier to roll back the bad part without throwing away safe work.

What is the fastest review check before merge?

Check five things fast: the starting repo state, the exact prompt, the action log, the test record, and one plain sentence about the biggest risk. If any part is missing, stop and fill the gap before merge.