AI coding benchmarks for real repo work that teams trust
AI coding benchmarks work better when they use bug fixes, refactors, and migrations from your own repo, scored with a simple, fair process.

Why public scores miss your real work
Public benchmark scores look neat because the tasks are neat. Most use short prompts, isolated files, and problems with one obvious answer. Real teams almost never work like that.
A model can rank well on toy tasks and still struggle in your repo. In day to day engineering, the hard part usually is not writing fresh code. It is reading old code, keeping behavior the same, and making a small fix without breaking three nearby systems.
Every codebase has baggage. It has naming habits from past teams, helpers nobody would write today, comments that only partly match reality, and rules that live in tests or in one senior engineer's head. Public benchmarks rarely include that mess, even though that mess often decides whether a model helps on Monday morning.
Small edits matter more than flashy demos. Many engineers spend their week changing a validation rule, patching an edge case, or updating one API field across several files. A model that likes broad rewrites can fail badly here. It may replace patterns your team relies on, touch too much code, or solve the wrong problem.
That is why public AI coding benchmarks can send you toward the wrong tool. If you want a result you can trust, test models on the kind of work your team already does.
Choose tasks from your own repo
The best benchmark tasks come from real work your team already finished. Closed tickets are ideal. They give you a real problem, a known fix, and a fair way to compare model output with what actually shipped.
Pull tasks from the last few months, not from a polished demo branch. Normal engineering work is uneven, and that is the point. One ticket fixes a broken validation rule. Another renames a service across five files. A third upgrades a package and breaks tests in places nobody expected.
A useful set usually includes a bug fix with a clear failing case, a refactor that changes structure but not behavior, a dependency or framework upgrade that forces small edits across the repo, and at least one messy ticket with unclear notes, old code, or thin test coverage.
That mix matters. If you only test shiny feature work, you miss the chores that eat most engineering time. Many teams spend more hours on patching, cleanup, and migration than on new features.
Difficulty matters too. Include a few easy tasks, a few medium ones, and at least one task that feels annoying to do by hand. Easy tasks show whether a model follows directions. Medium tasks show whether it can trace logic across several files. Messy tasks show whether it can recover from incomplete clues without making reckless changes.
Before you hand anything to a model, strip out secrets and customer data. Replace API keys, emails, account numbers, internal URLs, and private names with safe placeholders. Keep the shape of the problem intact, though. If the bug depended on a specific payload or log message, preserve that structure and wording.
Do not remove the failing test, stack trace, or original error note. Those details often separate a smart fix from a guess. "Checkout returns 500 when coupon is empty" is fine. The exact failing test and error line are much better.
A product team can start with ten closed tickets and build a fair sample in one afternoon. That already gives you something more useful than a puzzle repo built for show.
Write briefs models can follow
A vague prompt ruins a fair test. If one model gets a clean task and another gets a messy note, you are grading your prompt, not the model.
Write each brief the way you would write a small ticket for an engineer who has never touched that part of the repo. Start with one sentence that says the job in plain English. Then point to the files or folders that matter, explain what looks wrong, and add any repo rules the model has to follow.
Keep the brief concrete. Say the goal in one line. Point to the likely files. Mention rules already used in the repo, such as style, tests, logging, or no new packages. Define done in a way another engineer could check.
The "done" line matters more than many teams think. "Refactor this service" is too loose. "Remove duplicate parsing logic, keep the same output, and update tests for the two affected endpoints" gives the model a finish line.
Context should stay short, but it still needs to cover hidden assumptions. If the task depends on a business rule buried in an old chat or ticket, put that rule in the brief. Do not expect the model to guess why one edge case matters more than another.
Scope matters too. Do not dump twenty files into the prompt when two files and one failing test explain the job. Extra noise makes weak answers look thoughtful and good answers look slow.
Keep the brief identical for every model. Do not give one model extra hints after it stalls. Do not paste a stack trace into one run and leave it out of another. If you want a second round with more help, treat that as a separate round and compare it on its own.
A brief like this is enough:
"Fix the retry bug in billing/webhook.ts. Duplicate invoices appear when the payment provider sends the same event twice. Follow the existing TypeScript style, use the current logger, and do not add dependencies. Done means one invoice per event, tests updated, and no change to the webhook response."
That is the standard you want. The model gets enough context to act, and your team gets a result you can compare across runs.
Run the benchmark the same way every time
A fair test depends on boring consistency. If one model starts from a different commit, sees extra files, or gets more tools, you are measuring your setup, not the model.
Use the same repo state for every run. Create a fixed starting point, reset to that commit before each attempt, and verify that dependencies, config, and test data match. Tiny drift can change the result more than the model itself.
A simple process works well:
- Reset the repo to the exact same commit.
- Give the model one task with one written brief.
- Allow the same tools each time, such as terminal access, search, or test commands.
- Save the full prompt, the model output, the files changed, and the time spent.
- Ask an engineer to apply the patch exactly as produced, with no cleanup first.
Running one task at a time matters. If you bundle a bug fix, a refactor, and a migration into one prompt, planning heavy models get an edge and faster models can look worse than they really are. Split the chores so you can see where each model helps and where it slows people down.
Tool access needs the same discipline. If one run can inspect the whole repo, execute tests, and read migration history while another run only sees pasted files, the comparison falls apart. Pick a level of access that matches how your engineers actually work, then keep it fixed.
Save everything, including the messy parts. Keep the original brief, each follow up prompt, raw output, terminal logs, test results, and total time. Later, when two models feel close, those records explain why one took 8 minutes and the other took 25.
Patch review should start cold. An engineer should try to apply and run the change exactly as the model produced it. If they have to rename variables, fix imports, or rewrite half the migration before tests pass, count that effort. Teams often miss this and end up making weak outputs look better than they are.
Once you do that, benchmarking code assistants starts to look like real repo work instead of demo day success.
Score the output in plain terms
A model does not win because it writes more code or finishes first. It wins if your team can merge the change without a long repair session. Start with the basic question: did the output actually solve the task?
For a bug fix, reproduce the bug, apply the patch, and confirm the bug is gone. For a refactor, compare behavior before and after. For a migration task, run the updated code in your repo and see if it still works with your current setup.
A small scorecard keeps the test grounded:
- Task solved: mark 1 if the change meets the brief and 0 if it misses.
- Tests: record how many existing tests pass and whether the model introduced new failures.
- Hand editing: note how much work your team did after the model stopped, such as minutes spent or number of extra commits.
- Scope control: count changes outside the target files or target problem.
- Readability and style: rate whether the code matches your naming, structure, and error handling patterns.
Give the first item the most weight. A neat patch that fails the task should lose every time.
Hand editing tells you a lot. If one model gets close but a developer still spends 40 minutes fixing names, imports, edge cases, and tests, that cost is real. Another model may produce less code but need only two small edits. Most teams should pick the second one.
Watch for risky changes outside the target area. If the brief asked for a one file fix and the model rewrote six helpers, changed configs, and updated unrelated tests, review risk rises fast. That kind of patch often looks smart at first and causes trouble later.
Readability matters because your team has to live with the code. Check naming, function size, comments, and whether the patch follows the repo's usual patterns. If your codebase prefers simple helpers and clear tests, a clever rewrite with extra layers is a poor fit.
Keep the rubric plain enough that two reviewers would reach almost the same result. If every score turns into a debate, the rubric is too vague to help you choose a model.
A simple example from one product team
One small SaaS team skipped public leaderboards and tested three tasks from one repo. The repo had an older login service, a React account page, and an API client that broke after a package upgrade. It felt like normal repo work, not a lab exercise.
The first task was a login bug. Users with older password hashes could not sign in after a change in the auth service. Both models found the bad branch, but they handled it very differently. Model A rewrote part of the flow, added a helper, and touched five files. The bug disappeared, but the diff was larger than the team wanted. Model B changed one condition, added a focused test, and left the rest of the service alone.
The second task was a small React refactor. Two account settings components repeated the same form logic. Model A built a new abstraction and moved state into a shared hook. It worked, but it changed the page more than necessary. Model B kept the component layout, pulled out one helper, and matched the naming and file style already used in the repo. Review went faster because nobody had to relearn the page.
The third task came from a package upgrade. A newer client library changed one request field, and an API call started failing. Model A patched the call but also changed error handling and updated unrelated imports. Model B fixed the field name, adjusted one test, and stopped there.
The team kept scoring simple. Did the change solve the problem? How many files did it touch? How long did review take? Did the code look like the rest of the repo?
Model A looked stronger if you only counted raw output. It wrote more code, moved faster, and suggested broader cleanup. Model B won because engineers merged its changes with less editing and fewer follow up comments. That matters more than a flashy answer.
Common mistakes that skew the result
Most failed AI coding benchmarks do not fail because the models are close. They fail because the test setup is sloppy. A small mismatch in prompts, tools, or task choice can make one model look better than it is.
The most common mistake is giving one run extra help. A developer pastes a stack trace into one prompt, then forgets to include it for the next model. Or they answer a follow up question for one model but not for the others. That turns a model test into a coaching test.
Difficulty drift causes another bad comparison. One task fixes a typo in a React page. The next asks for a database migration, test updates, and a risky refactor across six services. If the work is not in the same range, the score tells you very little.
Tool access matters just as much as prompt quality. If one model can search the repo, run tests, and inspect git history while another only gets pasted files, you are not comparing models. You are comparing environments.
A few habits keep the benchmark honest. Give every model the same brief, files, and follow up rules. Group tasks by difficulty before you start. Match tool access as closely as you can. Score code quality, not just time to first answer. And include messy, older parts of the repo, not only the folders everyone likes to demo.
Speed can fool teams. One model replies in 20 seconds with code that passes one test and breaks three hidden assumptions. Another takes two minutes and leaves cleaner diffs, safer naming, and fewer side effects. The faster answer feels better in the moment, but engineers pay for the mess later.
Teams also cherry pick modern code because it is easier to benchmark. That creates a false sense of fit. Real repo work includes old migrations, shell scripts, half documented services, and strange naming choices from five years ago. If your benchmark ignores those areas, it will overrate models that only look good on tidy code.
If you want results you can trust, treat fairness as part of the benchmark itself. Tight rules beat fancy scoring.
Quick checks before you trust the winner
A model that wins once can still disappoint a week later. The goal is not to find a one time champion. It is to find a tool your team can rely on.
Run a small retest on a different day. Use three or four tasks from the same benchmark and see if the model stays steady when prompts, timing, or minor repo changes shift a bit.
A second reviewer helps a lot. Ask that person to score the output without knowing which model wrote it. Blind scoring cuts down bias, especially when someone already has a favorite tool.
The results should also hold up in more than one part of the codebase. A model may look great in UI cleanup work and then struggle with backend bugs, tests, or migration tasks. If your product has two very different repo areas, sample both before you decide.
Keep the final check simple. Re run a few tasks on a new day. Have another reviewer score outputs blind. Compare performance across at least two repo areas. Then ask whether the better result is worth the extra spend.
Cost deserves a direct look. If Model A saves 10 minutes on a refactor but costs three times more than Model B, that choice may not hold up across a full month of engineer use. Small quality gains can be real and still not worth the bill.
Keep a short notes sheet while you review. One line per task is enough: missed project convention, edited the wrong file, broke tests, wrote a strong migration plan, got stuck in a loop. Those odd failures often tell you more than the final score.
Watch for repeat mistakes. If a model keeps ignoring repo structure or keeps rewriting more code than needed, believe that pattern. A clean average score can hide habits that frustrate engineers every day.
What to do after the first round
The first run gives you a snapshot, not a final answer. Teams get more from it when they turn the test into a small habit.
Save a compact task set you can run again every quarter. Keep it small enough that engineers will actually use it: a few bug fixes, one refactor, one migration task, and maybe one test writing job. If you rerun the same set over time, you can spot whether a model improved, got worse, or drifted away from the kind of work your repo needs.
Do not force one winner onto every task. Many teams find that one model handles bug fixes well, while another does cleaner migration work or safer refactors. That split is normal. If a model is fast but sloppy with schema changes, keep it away from migrations. If another model reads existing patterns better, give it maintenance work.
A small routine is enough: keep 8 to 12 reusable tasks in a private benchmark set, rerun the set every quarter or before you renew a vendor contract, replace tasks that no longer match your stack or coding style, track results by task type as well as total score, and share examples of good and bad outputs with the team.
Update the benchmark when your stack changes. A set built around a Node.js monolith will not tell you much after you move part of the product to Go services, add Terraform, or change test tooling. Repo testing only stays useful when it reflects the code people touch every week.
Share the results with engineers first. Managers may care most about cost and speed, but engineers see where a model wastes review time, breaks local conventions, or writes code nobody wants to maintain. A short review meeting with real examples usually teaches more than a scoreboard.
If your team wants a second opinion on the process, Oleg Sotnikov at oleg.is does this kind of Fractional CTO and AI first advisory work. That kind of review is most useful after you already have one round of benchmark results and want to make the next round sharper.
Frequently Asked Questions
Why are public AI coding benchmarks not enough?
Because public scores usually measure short, clean tasks with one obvious answer. Your team works in old code, odd naming, partial comments, and small fixes where a model can do more harm than help.
What kinds of repo tasks should I test?
Start with real closed tickets from your own repo. Use a mix of bug fixes, refactors that keep behavior the same, upgrade or migration chores, and at least one messy task with thin context or old code.
How many tasks do I need for a useful first benchmark?
Ten closed tickets give most teams a solid first sample. That is enough to spot whether a model handles easy edits, medium tracing work, and annoying maintenance jobs without turning this into a week-long project.
What should I put in the prompt or task brief?
Write the brief like a small ticket for a new engineer. Say the goal in plain English, point to the files that matter, note repo rules such as no new packages, and define done so another person can check it.
Do I need to give every model the exact same setup?
Keep every run on the same commit with the same tools and repo access. If one model can search the repo or run tests and another cannot, you end up comparing setups instead of models.
Should I edit the model output before I judge it?
No. Apply the patch exactly as the model produced it first, then count any cleanup your team had to do. Rename fixes, import repairs, and test repairs all belong in the score.
How should I score the results?
Score task success first, then look at tests, edit time, scope control, and code fit with your repo. A small patch that solves the issue and needs two minutes of cleanup should beat a flashy diff that starts extra review work.
What mistakes make the benchmark unfair?
The usual problems are extra hints, uneven task difficulty, and mismatched tool access. Teams also skew results when they test only tidy modern code and skip the older parts engineers actually touch every week.
How can I tell if the winner is reliable?
Run a small retest on another day and ask a second reviewer to score outputs without model names. Also test at least two repo areas, because a model that looks good in UI cleanup may struggle with backend bugs or migrations.
How often should we rerun the benchmark?
Rerun a small private set every quarter or before you renew a vendor contract. Refresh tasks when your stack changes, and keep notes on repeat failure patterns so you can see whether a model still fits your real work.