Jul 28, 2025·7 min read

Spec first coding with assistants for smaller, cleaner diffs

Spec first coding with assistants starts from examples and failure cases, cuts noisy diffs, and makes code reviews shorter and less personal.

Why assistant-generated code often causes review fights

Most review fights start before anyone writes code. A vague prompt gives the assistant too much room, so it solves more than you asked for.

If someone writes, "clean up the signup flow and add validation," the assistant has to guess what "clean up" means. It might rename files, move logic, rewrite tests, change error text, and swap patterns that were never part of the request.

The author sees a feature that works. The reviewer sees a diff that touches eight places for a change that should have touched two. That gap creates tension fast.

Assistants are good at filling in blanks. In team code, that often causes trouble. When the prompt leaves out limits, the tool invents them. When the prompt skips team rules, the tool falls back to its own default choices.

Then the review stops being about behavior. People argue about naming, folder structure, helper functions, and whether the assistant overbuilt the fix. Those arguments drag on because nobody is reviewing the same target.

A simple example makes this obvious. Say the task is to reject invalid emails on a signup form. One reviewer expects the form to trim spaces first. Another expects it to keep the current message text. A third cares that the API contract stays unchanged. If none of that appears in the prompt, the assistant picks one version and moves on.

Missing failure cases make the diff worse after it lands. The happy path looks fine, so the change gets approved or almost approved. Then someone asks what happens with a duplicate account, an empty field full of spaces, or a server timeout. Now the team has follow-up commits, review churn, and a messy history for a small feature.

That is why a spec-first approach usually leads to calmer reviews. When behavior is clear before code starts, the assistant makes fewer guesses, the diff stays smaller, and reviewers can judge the change by what users will actually see.

What a spec-first start looks like

A good start does not begin with code. It begins with the user action you want to support. Write that action in one plain sentence, such as: "When a customer changes their delivery address during checkout, the order should use the new address before payment." That gives the assistant a target people can review without reading a line of code.

Then describe the result in simple words. Skip internal details at first. Say what the user sees, what gets saved, and what should happen if the action fails. If a reviewer can read the note in 30 seconds and picture the behavior, it is clear enough.

A short note often works better than a long brief. Cover the normal path, a couple of ways the feature can break, and any rule the assistant must not touch. That last part matters more than people think. Many review fights start because generated code changes logging, validation, naming, or unrelated screens while solving the main task.

A useful spec can be very small. It should name the user action, the expected result in plain language, one normal example, two or three failure cases, and a short "do not change" line.

Picture a refund form. A normal example might say that a support agent enters an amount lower than the original charge, clicks submit, and sees the refund recorded once. A failure case might say the form must reject negative amounts. Another might say a second click must not create two refunds while the first request is still running.

You should also say what stays unchanged. Maybe the current API response shape must stay the same. Maybe the audit log format cannot change because another tool reads it. One line like that can cut a diff in half.

Keep the note short. If it turns into a page of background, the assistant will grab too many details and guess at the rest. A compact spec is easier to scan, easier to challenge, and much easier to compare against the final diff.

Write examples before you ask for code

When you ask for code too early, the assistant fills gaps with guesses. That is how a small change turns into a wide diff. A few concrete examples usually stop that fast.

Use real inputs and real expected outputs. "Valid email" or "bad value" is too vague. Write the exact text a user enters, the exact result you want, and the exact error you accept. In practice, behavior examples for code generation do more work than long instructions.

Start with one normal path from beginning to end. Name the exact place for the change: the file, the function, or the screen. That detail matters. If you do not name the target, the assistant may edit helpers, shared types, and UI copy you never meant to touch.

File: src/auth/signup.ts
Function: createAccount(form)

Normal case:
- Input: name = "Mia Chen", email = "[email protected]", password = "correct horse battery staple"
- Expected result: account is created, welcome screen opens, confirmation email is queued

Odd but valid case:
- Input: name = "Jean-Luc Picard", email = "[email protected]"
- Expected result: account is created, hyphen stays in the name, plus-address is accepted

Pass when:
- Two tests cover both cases
- Existing signup and login tests still pass
- Changes stay inside signup files and related tests

That odd but valid case matters. It catches mistakes assistants often make, like over-strict validation, unwanted trimming, or cleanup logic that changes user data.

Be direct about what passing means. Say whether you want unit tests, one UI test, or both. Say what must stay unchanged too. For example, you might require the old login flow to keep working and the error text to match current copy.

This takes a few extra minutes, but it saves review time later. Reviewers can compare behavior against examples instead of arguing about style, guesses, or code the assistant changed for no clear reason.

Add failure cases up front

Assistants usually guess the happy path. That is where messy diffs start. If you want calmer reviews, tell the model what the code must reject before it writes a single branch.

Failure cases do more than prevent bugs. They also narrow the shape of the code. You get fewer surprise checks, fewer hidden assumptions, and less debate about what should happen in edge cases.

For each bad input, write the exact outcome. Do not stop at "handle errors gracefully." Say which error appears, which status code returns, whether the app retries, and whether the system saves anything at all.

A short list is enough:

Empty data should fail with a clear message, not a generic server error.
Missing required fields should stop the flow before any write happens.
Duplicate records should follow a defined rule, such as "use existing record" or "reject with duplicate error."
Wrong formats should say what was wrong, not just "invalid input."
One known past bug should appear as a test case, written exactly as it happened.

That last point matters more than people expect. If a bug already burned your team once, put it in the prompt. Say, for example, that the old version created two invoices when a user clicked "submit" twice, and the new code must never double charge, double send, or create partial data.

"Must never" rules are useful because they stop quiet damage. Write them plainly. The code must never overwrite an existing account during import. It must never send a confirmation email if validation fails. It must never hide a parsing error and keep going.

Small details save review time. If email is missing, return 400 with "email is required." If a duplicate external ID arrives, keep the first record and log the second attempt. When you spell out those edges, the assistant has less room to improvise, and the first draft is much more likely to match the team's rules.

Use a simple workflow

Shape Your Claude Workflow

Get help shaping custom skills and hooks around your team process.

Get Expert Help

Start with a short note that says what the feature should do for a user. Keep it plain. If someone outside engineering cannot follow it, the note is still too loose.

A simple routine works better than a huge prompt full of background:

Write 4 to 6 lines of expected behavior in plain English. Focus on what the user does and what the system returns.
Add three to five examples. Use real inputs and expected outputs, not abstract rules.
Add failure cases before you ask for code. Include bad input, missing data, permission errors, or edge cases that often start review debates.
Ask for the smallest code change that satisfies only those cases. Say that you want a minimal diff and no unrelated cleanup.
Review the diff against the written behavior. If the code matches the spec, do not reject it just because you would have written it differently.

This changes the tone of review fast. People stop arguing about style and start checking whether the behavior is right.

A small example shows the pattern. Say you need a signup form to reject disposable email domains. Your note can say: accept company emails, reject known disposable domains, show a clear error, and keep the entered name so the user does not have to type it again. Then add examples like "[email protected]" passes and "[email protected]" fails. Add one failure case for an unknown domain service timeout. Now the assistant has a fence around the task.

After the code lands, run tests and compare the result to the examples. If a test fails because the spec was vague, fix the spec first. Then ask for the next change. Teams that do this well keep changes small, reviews calmer, and follow-up edits rare.

A signup form is a good place to test this method because the expected behavior is easy to name. Small rules matter, and small mistakes annoy users fast.

Say a user types [email protected] with spaces before and after the address. The form should trim those spaces before it validates or saves anything. If you skip that detail in the prompt, the assistant may leave the bug in place or add a larger refactor than you asked for.

Now add another rule. If the email already exists, the form should show one clear error. One message, in one place, is enough. You do not want a toast, an inline warning, and a disabled button all firing at once.

The name field needs the same clarity. If the name is blank, the form blocks submit and tells the user why. That sounds obvious, but vague prompts often lead to mixed behavior where the button works in one state and fails silently in another.

The spec also needs a boundary: change validation only, not the page layout. That single line saves a lot of review time. Without it, the assistant may reorder fields, rename labels, or tidy the component structure even when none of that helps the bug.

A clean prompt for this form is short. It says the email input trims spaces, duplicate emails show one clear error, blank names block submit, and the layout stays untouched. That is enough to guide the code without inviting extra edits.

When the reviewer opens the diff, they can check each example against the change. Did the code trim the email before validation? Did duplicate email handling produce one error path? Did blank names stop submit every time? Did the assistant leave the layout alone?

That review style changes the tone of the whole discussion. People stop arguing about taste and start checking behavior. The comments get more precise because the target is clear.

Mistakes that make diffs grow

Support for Startup Teams

Oleg helps founders and engineers keep AI coding practical day to day.

Book Advisory

Big diffs usually start with a loose request, missing examples, and a prompt that gives the assistant too much room to guess.

That guesswork shows up quickly in review. A small bug fix turns into renamed helpers, moved files, changed validation rules, and style debates that had nothing to do with the original problem.

A few habits cause this again and again.

Asking for a rewrite instead of a narrow change is the first one. If you say "improve this flow" or "clean up the form logic," the assistant will touch more code than you wanted. Ask for the smallest patch that changes one behavior.

Another mistake is mixing fresh ideas into the current fix. Teams often slip in a refactor, a copy update, and a new edge case while fixing one bug. That makes it hard to tell what broke and why.

Skipping failure cases is another common problem. They are obvious to you, not to the assistant. If you do not say what should happen on bad input, timeouts, duplicates, or empty states, the tool will invent behavior.

Leaving old behavior unstated causes the same trouble. When you do not say what must stay the same, the assistant may simplify parts that users already depend on. Reviewers then argue about whether the new behavior is a bug or cleanup.

And then there is naming. Naming matters, but it should not lead the change. If the logic still fails basic examples, a long discussion about function names just delays the real fix.

A small signup change shows the difference. If you ask an assistant to "improve signup validation," you may get new error text, refactored validators, a changed submit flow, and extra analytics events. If you ask it to "block disposable domains, keep all other rules and messages the same, and return the existing error state format," the diff stays much smaller.

That is where spec first coding with assistants helps. A short behavior note, a few examples, and a small set of failure cases cut down most review fights before they start.

Small diffs are not about being timid. They make intent clear. When reviewers can see one behavior change and everything else staying put, they decide faster and argue less.

Quick checks before you merge

Improve Assistant Output

Clean specs and review rules lead to code that matches the request.

Schedule Call

Before you merge, the spec should still work like a checklist. If one behavior example has no matching test, validation rule, or code change, the work is not done.

Use the diff and the spec side by side.

Every example in the spec should map to something concrete. That can be a test, a condition in the code, or a small UI change. If the spec says "trim spaces in email input," the reviewer should find that exact behavior without guessing.

The diff should touch only the files needed for that behavior. If the assistant changed helpers, renamed variables across the app, or reformatted unrelated files, ask for a tighter pass. Extra motion makes simple reviews slow.

Error paths should be easy to verify. Check invalid input, duplicate data, timeouts, and permission problems. Each one needs an expected result, such as a clear message, a status code, or no database write.

The spec should also name what stays the same. If the change is only for signup validation, say that login, password reset, and email templates must stay unchanged. That single line prevents accidental drift.

A reviewer should be able to say yes or no quickly. If they need a long call to figure out intent, the spec is too loose or the diff is too wide.

A small signup change shows the point well. If the spec only adds "reject disposable email domains," the merge should probably contain one validation rule, one or two tests, and maybe a short error message update. It should not rewrite the whole form or move code into a new pattern just because the assistant felt like it.

The last check is simple: does the patch prove the requested behavior, and only that behavior? If the answer is clear, the merge is ready.

What to do next with your team

Pick one small ticket this week and run it as a test. Do not start with a broad refactor or a risky feature. A plain form change, a simple API rule, or one validation fix is enough to see whether this approach gives you smaller diffs and calmer reviews.

Keep the experiment boring on purpose. One developer writes the expected behavior first, including a few examples and a few ways the feature can fail. Then the assistant gets that spec instead of a loose prompt like "build the signup logic."

A shared template helps more than long prompt advice. Save one short document that everyone can copy. Include the user action or input, two to four expected behavior examples, two to four failure cases, anything that is out of scope, and the exact request you give the assistant.

Keep that template where the team already works. If it is hard to find, people will stop using it after a week.

Measure the change with simple numbers. Compare review time before and after. Also count how many review comments ask for behavior changes instead of small code fixes. If that number drops, the spec is doing its job.

A short retro after the ticket is enough. Ask what felt clearer, where the assistant still guessed, and which failure case you missed. Then update the template once. Do not keep rewriting the process every day.

If the test works, expand it slowly. Try it on five similar tickets before you make it a team rule. That gives you enough real examples to build habits without turning the process into homework.

Sometimes teams need help because the problem is not the prompt. It is the engineering process around the prompt. If reviews keep turning into arguments about scope, architecture, or missing edge cases, an experienced Fractional CTO can tighten the workflow and keep it practical.

That is also the kind of process work Oleg Sotnikov talks about on oleg.is. His advisory work with startups and small teams focuses on making AI-assisted development usable in day-to-day engineering, with clearer specs, smaller diffs, and less back-and-forth in review.