Dec 27, 2025·8 min read

Test splitting in CI when your suite stops finishing fast

Test splitting in CI helps teams shorten waits by sharding by file, package, or past runtime while keeping flaky tests visible and fixable.

Test splitting in CI when your suite stops finishing fast

When the suite stops finishing on time

A test suite can feel fine until it suddenly doesn't. One month it finishes in 4 or 5 minutes. A few releases later, the same pull request sits in CI for 30 or 40 minutes, and every small change starts to feel expensive.

That delay changes how people work. Engineers batch more code into each commit because they don't want to wait again. Reviews slow down because feedback arrives late. Merges pile up. Small fixes that should take ten minutes can drag across half a day because every check blocks the next step.

The pain isn't just the total runtime. Long waits also make feedback less useful. If CI responds too slowly, developers move on to something else, then come back later to a red build with little context left in their head. Simple bugs take longer to understand.

Before changing anything, separate slow tests from flaky tests. They can look similar on a bad day, but they waste time in different ways. Slow tests pass consistently and burn minutes on every run. Flaky tests fail on and off, which creates retries, reruns, and extra investigation. If you mix the two together, you'll solve the wrong problem.

Reruns make this harder to see. A team may think the suite takes 18 minutes because the final green run says 18 minutes. In real life, people waited 35 minutes because two jobs failed, someone clicked rerun, and the next attempt passed. The dashboard looks better than the actual developer experience.

That is why test splitting in CI should start with a clean diagnosis, not with more parallel jobs. If one package is always slow, split for speed. If a few tests fail randomly, fix or isolate them first. Otherwise you spread unstable tests across more workers, hide the source of delay, and make the pipeline look busy without making it much faster.

A simple rule works well: measure the time people actually wait, count how often they rerun jobs, and name the flaky tests before deciding how to split the suite.

Choose the unit you split on

Pick the split unit before adding more runners. That choice decides whether jobs finish at about the same time or whether one slow shard still holds everyone back.

File sharding is the simplest option. You take the full list of test files, split it into even groups, and run one group per job. It works well when most files take about the same time and each file can run on its own without strange setup rules. It is usually the fastest first step because most test tools already support running selected files.

Package sharding fits codebases that already have clear module boundaries. If tests under billing, auth, and search mostly stay inside those areas, split by package or folder and keep that structure in CI. This is easier to reason about than file sharding, and failures are often easier to place because the shard name matches part of the product.

Historical runtime splitting is usually the best balanced option. Instead of counting files or packages, you use past timing data and try to give each job roughly the same total runtime. If one file takes 8 minutes and ten others take 10 seconds each, runtime data keeps that heavy file from ruining one shard. This is the right fit when the suite has a few very slow tests mixed in with many quick ones.

The tradeoff is pretty plain. File sharding is easy to set up, but it gets uneven as file times drift apart. Package sharding is clean when the repo already groups tests well, but it falls apart if package sizes differ too much. Historical runtime splitting takes more setup and ongoing measurement, yet it usually gives the shortest wait time.

Upkeep matters more than most teams expect. File and package splits need cleanup as the suite changes. Runtime splits need stored timings and a fallback for new tests that have no history yet.

A good starting point is this: use file sharding if the suite is still fairly even, use package sharding if the repo already reflects product boundaries, and move to historical runtime splitting when one or two slow areas keep making shards lopsided. Whatever you choose, keep results visible per shard so flaky tests don't disappear inside a shorter pipeline.

Measure the suite before you change it

Start with a baseline. If you jump into CI test sharding without one, you can make the dashboard look faster while developers still wait just as long.

Pull recent runs from a normal week, not a single lucky day. Record four numbers for each run: total suite time, queue time before jobs start, the longest job, and the full wait from commit to result. That last number is what your team feels.

Then look for where the time actually goes. In many teams, a small share of files or packages eats most of the runtime. One shard finishes in 6 minutes, another drags on for 24, and the whole pipeline waits for that slow lane.

A useful baseline might look like this:

  • Median full wait time: 38 minutes
  • Median queue time: 7 minutes
  • Slowest test job: 21 minutes
  • Top 15 test files: 46% of runtime
  • Flaky failures and reruns per week: 18

That last line matters. Before you change anything, count flaky failures, timeouts, and reruns. If the number drops after sharding, check why. Better balance is good. Silent auto-reruns that hide unstable tests are not.

Set one clear target. "Cut median wait time from 38 minutes to 20 while keeping failure signals clear" is enough. "Make CI faster" is too vague. You need a target that tells you whether the change helped or whether it just moved time from one part of the pipeline to another.

A small example makes this obvious. If your suite takes 30 minutes, but 9 of those minutes come from queue time and 14 sit in one overloaded shard, adding more shards alone won't fix the real problem. You may need better balancing, less runner contention, or both.

Measure first, then split. CI is bad at forgiving guesses.

Set up file sharding step by step

File sharding is the easiest place to start. You take the full list of test files, divide it into a few groups, and run those groups at the same time. If your suite has 120 files, start with 4 shards of about 30 files each.

Keep the first version boring. Split files by count, not by intuition. That gives you a clean baseline and shows whether the suite is fairly even or badly skewed.

The rollout can stay simple. Collect the full list of test files in a stable order, divide that list into equal groups, run one shard per CI job with the same test command, and save the runtime for each shard after every run.

Make the environment match across shards. Use the same machine size, the same dependency install step, the same cache rules, and the same startup work. If shard 3 spends two extra minutes building test data or pulling a missing image, your numbers stop meaning much.

Watch the spread between the fastest and slowest shard for a week or two. If three shards finish in 6 minutes and one takes 14, the split works on paper but not in practice. The whole pipeline still waits for the slowest job.

That is when you move outlier files. A few large integration tests often cause the long tail. Shift one or two heavy files from the slow shard into a faster one, then measure again. Small moves work better than constant reshuffling.

Keep a record of per-shard timing after each run. You don't need anything fancy at first. Basic logs or CI job metrics are enough to show drift over time.

One warning matters here: don't hide flaky test visibility while you tune the split. If one file fails often, let that failure stay attached to the shard that ran it. Retries can help unblock merges later, but they blur the first signal when you're still learning where the pain is.

Most teams get a decent win from file sharding before they need anything smarter. It is simple, easy to explain, and gives you real timing data for the next round of cleanup.

Use package splits when the codebase already groups well

Review Your Runner Setup
Find queue time, cache misses, and setup work that slow every merge.

Package splits work when the codebase already has clean module boundaries. If the auth package mostly tests auth, and the billing package mostly tests billing, you can split by package with little setup and cut wait time without much risk.

This approach is often easier to reason about than file-level splitting. Engineers can see which shard maps to which part of the product, and when one shard slows down, the cause is usually obvious.

The catch is structure. Package splits only stay useful if packages reflect real ownership and real runtime cost. A codebase with one huge "common" package and a dozen tiny packages won't balance well, even if the shard count looks tidy.

Keep shared test helpers out of an oversized package that every run has to pull in. Put fixtures, builders, mocks, and setup code in a separate test utility area so one package doesn't become a dumping ground for unrelated tests.

A common mistake is throwing all integration tests into one package because they are slower and feel different from unit tests. That usually creates one shard that crawls while the others finish early. It also hides where the time actually goes. If an integration test belongs to payments, orders, or search, keep it close to that module unless you have a strong reason not to.

A healthy package split usually follows a few simple rules:

  • group tests by product module, not by test type alone
  • move shared helpers into a neutral place
  • watch for one package that grows faster than the rest
  • split large packages before they become the new bottleneck

Review package balance every few weeks. Teams add tests gradually, so a split that looked even last month can become skewed without anyone noticing.

If one package starts taking half the suite time, treat that as a signal. Maybe the module needs a finer split. Maybe too many unrelated tests ended up there because it was convenient. Fixing that early keeps CI test sharding honest and keeps flaky test visibility intact instead of burying slow and unstable tests inside one giant bucket.

Use runtime data when the suite has uneven tests

When one test file finishes in 6 seconds and another takes 4 minutes, file counts stop telling the truth. You can give each shard the same number of files and still end up with one slow job that holds the whole pipeline open. In that case, test splitting in CI should use recent run times instead of raw file totals.

Historical runtime splitting works well once the suite has enough repeat runs to show a pattern. Most teams do fine with an average from the last 5 to 20 successful runs. That smooths out a single bad result from a busy runner, a cold cache, or a slow network call. If the suite changes a lot each week, use a smaller window. If it stays fairly stable, a larger window gives calmer splits.

A simple rule helps: add up the recent run time for each test file, sort the slowest files first, then place each one into the shard with the lowest total so far. That usually gets you much closer to even shard times than splitting by file count or package alone.

New and renamed tests need a fallback. If a file has no timing history yet, place it by a simple backup rule such as file count or package, then measure it on the next run. Don't guess too aggressively. One unknown test shouldn't rearrange the whole layout.

Keep rebalancing predictable

Rebalance on a schedule, not on every run. Daily or weekly is enough for most teams. If you rebuild shard assignments every time, small timing blips move tests around constantly. That makes failures harder to compare and can hide patterns in flaky tests.

Keep failure reporting separate from shard balancing. A flaky test is still flaky even if it moves to a different shard. Track which tests fail most often, which ones swing wildly in duration, and which shards still finish last. Faster test suites matter, but clear visibility matters just as much.

This method usually pays off once the suite grows past the point where "10 files per shard" sounds fair but feels slow.

A simple example from a growing team

Lower CI Spend
Review runner spend and infrastructure choices that keep CI expensive.

A small product team started the year with a suite that finished in about 8 minutes. Twelve months later, the same pipeline took 45 minutes on every merge request. Nothing dramatic happened. The app grew, more browser coverage got added, and a handful of slow integration tests landed in the same CI jobs.

Their first fix was to split by package because the repo already had clean folders for API, web, and workers. That helped a little, but one package carried most of the expensive tests. The API shard finished in 7 minutes, workers in 5, and web still dragged on for about 24. Everyone else sat waiting for the slowest shard.

They then tried equal file counts inside the web tests. On paper, that looked fair. In practice, it wasn't. One shard got 110 small unit test files and finished in 6 minutes. Another got 108 files, but several of those files launched browsers, seeded data, and waited on async UI flows. That shard took 19 minutes. Same file count, very different cost.

After they measured per-file times for a week, they changed the split again. Instead of dividing tests by package or file count, they used historical runtime splitting. Each shard got a similar predicted runtime, even if the number of files was uneven. One job now ran 65 files, another 140, and that was fine because both finished in roughly 10 to 12 minutes. The full test stage dropped from 24 minutes to about 11.

They also made a smart choice with flaky browser tests. They didn't hide them behind blanket retries on every test job. That would have made the pipeline look healthier than it was. They retried clear infrastructure failures once, kept browser tests in a named group, and tracked which files failed more than once a week. People could still see the flakes, which meant people actually fixed them.

This pattern is common on lean CI setups. The split that works at 8 minutes often fails at 45. Once the suite gets uneven, runtime data usually beats neat folder boundaries.

Mistakes that make the numbers look better than they are

A split suite can look faster on paper while developers still wait just as long. That happens when the report changes more than the real bottleneck.

The most common trap is automatic reruns on every failure. A rerun can keep the pipeline green, but it also hides flaky tests and teaches the team to ignore noise. If a test fails once and passes on retry, track it. Keep flaky test visibility separate from pass rate, or the suite will slowly decay.

Another mistake is throwing very different tests into the same pool. Fast unit tests and slow integration flows don't behave the same way. Unit tests usually scale well across shards. Integration tests often fight over shared data, external services, ports, or long setup steps. If you mix them without labels, the split looks balanced until one shard gets three slow flows and blocks every merge.

Teams also forget the work around the tests. The test command might take 4 minutes, but the job may still spend 6 more on environment setup, cache misses, dependency install, container startup, database seed, or test data creation. If you only measure test runtime, you'll split the wrong thing. In some pipelines, cutting setup time saves more than adding more shards.

Averages can fool you too. If seven shards finish in 3 minutes and one takes 11, the pipeline still takes 11. The slowest shard sets the wait time, not the average. That is why a dashboard that celebrates mean runtime can hide a bad split.

After rollout, keep watching a short list of numbers:

  • total pipeline time
  • longest shard time
  • setup time before tests start
  • flaky failures per run

One growing team hit this exact wall. They doubled the shard count and celebrated a lower average, but merges stayed slow because one shard held most of the browser tests. After they labeled test types, measured setup separately, and stopped silent reruns, the real problem was obvious. The suite wasn't short. Only the chart was.

A short rollout checklist

Fractional CTO for CI
Bring in senior technical help without hiring a full-time CTO yet.

The first version of test splitting in CI usually cuts wait time fast, but the rollout still fails if people can't trust what they see. A good split is easy to read, easy to rerun, and easy to keep current.

Before calling the rollout done, check the plain details that keep daily work smooth.

Each shard should report failures in a way engineers can scan in seconds. Include the shard number, the files or package range it ran, and the exact failing test names in the job output.

Reruns should be simple too. One engineer should be able to rerun the same shard on a laptop or in CI with one clear command, without guessing which files landed there.

If you use historical runtime splitting, refresh timing data often enough to match the codebase. Busy teams often do well with a daily refresh. Smaller teams can refresh every few days, but once a month is too slow.

Review flaky test counts every week. Keep the count by shard and by test, or the team will start treating random failures as a sharding problem.

Readable failure output sounds minor, but it saves time every day. If shard 6 fails and nobody can tell what it ran, engineers dig through logs, search job config, and lose much of the benefit of the split.

Flaky tests also deserve their own review instead of getting buried inside general CI noise. If one shard fails at random twice a week, fix that test or quarantine it with a clear label. Don't let sharding hide it.

If one of these basics is missing, pause the rollout and fix it first. A split test suite should feel easier to use on day one, not harder.

What to do after the first rollout

Don't flip the whole suite at once. Start with one branch, one service, or one test group that hurts the most. That keeps the blast radius small and gives your team a clean before-and-after comparison instead of a messy change spread across every pipeline.

For the first two weeks, treat the new split like an experiment. Watch the numbers, but also watch team behavior. If engineers still wait just as long because reruns got worse, the dashboard may look better while the day-to-day experience does not.

Track a short set of measures: total feedback time from push to test result, rerun count per day or per merge request, false alarms such as timeouts or missing test output, and how often one shard finishes much later than the others.

A small win is enough at this stage. If a slow suite drops from 28 minutes to 16 and reruns stay flat, that is real progress. If it drops to 14 but people now spend 10 minutes figuring out which shard failed, you didn't gain much.

Talk to the engineers who touch the pipeline most. Ask one plain question: does this make failures easier to handle, harder to handle, or just different? Their answer usually tells you more than a build graph.

Be willing to remove the split if it isn't helping. Some teams keep a confusing setup because the change took effort and nobody wants to undo it. That usually makes things worse. If test splitting in CI adds noise, hides failures behind shard naming, or barely improves feedback time, roll it back and try a simpler split.

If the pipeline still drags after the first pass, the real problem may sit deeper than sharding. Test setup, container startup, database resets, and poor cache use often waste more time than the test code itself. If you need an outside review, Oleg Sotnikov at oleg.is works with startups and small teams on CI, infrastructure, and AI-augmented development workflows, and this sort of bottleneck is exactly the kind of issue worth fixing early.

Frequently Asked Questions

How do I know if I have a slow suite or flaky tests?

Look at repeat behavior. Slow tests pass and burn time on every run. Flaky tests fail on and off, trigger reruns, and waste time in bursts. Measure full wait time, rerun count, and name the tests that fail randomly before you change the split.

When should I start splitting tests in CI?

Start when feedback slows daily work, not when the suite just feels bigger. If pull requests wait 20 to 30 minutes, one job finishes far later than the rest, or engineers batch changes to avoid CI, sharding usually pays off. Take a baseline first so you can tell if the change helped.

Should I shard by file first?

Usually, yes. File sharding is easy to set up and easy to explain. It works best when most test files take roughly similar time and each file can run on its own. Start with equal file counts, then watch how far the slowest shard drifts from the fastest.

When does package sharding make more sense?

Choose package sharding when your repo already groups tests by real product areas like billing, auth, or search. Engineers can spot ownership faster, and shard names make sense in failures. Skip it if one package holds most of the heavy tests, because that shard will still hold the pipeline open.

When should I use historical runtime data?

Use it when file counts lie. If one file runs for minutes and many others finish in seconds, runtime data balances shards far better than raw counts. Take timings from the last few successful runs, place slow files first, and rebalance on a schedule instead of every run.

How many shards should I start with?

Keep the first rollout small. Four shards often give you enough data without turning the pipeline into a puzzle. If you start with too many jobs, setup time, queue time, and failure output can get worse even if test time drops.

What metrics should I watch after the rollout?

Track the numbers people actually feel: full wait from commit to result, queue time, longest shard time, rerun count, and flaky failures. Mean test time can look nice while one slow shard still blocks every merge. If you can only watch one number, pick median full wait time.

Should I turn on automatic retries for failing tests?

Use retries sparingly. A single retry for clear infrastructure trouble can unblock work, but blanket retries hide flaky tests and make the suite look healthier than it is. Keep first-failure visibility, and track which tests fail more than once a week.

What if sharding does not make the pipeline much faster?

Check the work around the tests. Dependency install, container startup, database seed, cache misses, and test data creation often eat more time than the test command itself. Fix the longest setup step or the busiest runner before you add more shards.

How often should I rebalance shard assignments?

Most teams do fine with a daily or weekly refresh. That keeps shard times close without moving tests around so often that failures get hard to compare. Give new or renamed tests a simple fallback rule until they build some timing history.