Aug 04, 2025·8 min read

Flaky end to end tests: sort, quarantine, and fix

Flaky end to end tests slow releases and drain trust. Learn how to sort failures by cause, quarantine carefully, and fix the tests that block work.

Flaky end to end tests: sort, quarantine, and fix

Why flaky tests keep coming back

A flaky test does more than turn one build red. It teaches the team not to trust the suite. After enough random failures, people stop asking "what broke?" and start asking "should I rerun it?" That sounds minor, but it changes how people work.

The first cost is time. Someone opens logs, reruns the job, waits another 10 minutes, and gets green on the second try. The build passes, but nobody learns much. When that happens a few times a week, the team loses hours in small pieces. Those hours never show up on a roadmap, yet they still slow delivery.

Patience disappears next. QA suspects the app. Developers suspect the test. Infra suspects the browser, the network, or the CI runner. Any of them might be right, but the team still gets stuck because nobody sorts failures the same way. The same argument comes back, and the same test fails again three days later.

The damage goes beyond rerun time. A red pipeline still delays a release, even when everyone thinks the failure is random. Alerts get ignored because too many of them lead nowhere. Once that habit sets in, a real bug can hide in the noise.

That is why these tests tend to stick around. They live between ownership lines. No one feels fully responsible, so the team treats them as background pain instead of a fixable problem.

Most teams do not need a giant cleanup project to start. They need a simple way to classify failures. Was it a product bug, a test bug, bad test data, a timing issue, or an environment problem? When everyone uses the same buckets, the next step gets much clearer.

One failure might need a code fix. Another might need better waits, stable data, or a short quarantine with an owner and a due date. That is how you stop turning every red build into a debate and turn it into a quick decision.

Sort failures by cause first

Teams lose time the moment a failed test turns into a fight about ownership. That usually happens too early. First sort the failure by cause. Ownership gets easier once the cause is clear.

For end-to-end tests, the first pass should answer one question: what kind of failure is this? Keep the labels plain and fixed:

  • app bug
  • test bug
  • data issue
  • environment issue
  • timing issue

These five buckets cover most failures without turning triage into detective work. An app bug means the product broke. A test bug means the script, selector, assertion, or setup logic is wrong. A data issue means the test data expired, changed shape, or never loaded. An environment issue means a service, network, browser, or runner had a bad day. A timing issue means the app and the test fell out of sync.

Use the same labels every week, even if they feel a little rough at first. Consistency matters more than perfect wording. After a month, patterns start to show up. Maybe half your failures come from timing. Maybe one shared dataset breaks every Friday. You will not spot that if each person describes failures in a different way.

Keep the first sort simple enough that anyone on the team can do it in a minute or two. They do not need to solve the whole problem yet. They only need enough evidence to place the failure in the most likely bucket.

Take a checkout test that fails because the "Submit" button stays disabled. If the logs show a 500 error from the backend, call it an app bug. If the page changed and the test still clicks an old selector, call it a test bug. If the account used by the test lost permission overnight, call it a data issue. Same failed test, different cause.

That simple sort changes the tone of triage. People stop guessing. They stop blaming QA for product bugs and stop blaming developers for broken fixtures. The team gets a shared map of what failed, why it failed, and who should take the next step.

Find the tests that hurt delivery the most

Not every failing test deserves attention first. Some fail often but waste only a few seconds. Others fail less often and still block a merge, delay a release, or slow down a hotfix when the team is already under pressure.

Start with normal delivery work from the last two to four weeks. Look at pull request checks, main branch runs, release pipelines, and hotfix pipelines. Ignore one-off experiments. You want the tests that interrupt real work, not odd failures from unusual runs.

For each failing test, track a few plain facts: how often it failed, how many merges it blocked, how many releases or hotfixes it delayed, how much time the team spent waiting or rerunning, and whether a rerun usually passed without changes.

That last point matters. A noisy test can fail ten times and still do less damage than one test that freezes a Friday release twice a month. Do not fix these tests in the order they annoy people. Fix them in the order they waste time.

The difference gets obvious fast. Imagine one test that fails 12 times in a month but clears on rerun in two minutes and never stops work. Another fails only 3 times, but each failure pulls in two engineers and burns 45 minutes while the release waits. The second test belongs at the top of the list.

You do not need a fancy scoring model. A short spreadsheet is enough. Mark each test as noisy or blocking, then rare or frequent. After that, sort by delivery impact first and failure count second.

Most teams find the same pattern: a long tail of annoying tests and a short list that hurts every week. That short list is your first repair batch. Fix even three of those tests and the whole release process usually feels calmer within a sprint.

A simple triage routine your team can follow

Most teams waste time because they treat every red test as a new mystery. A better habit is to start from the newest failed run, group similar failures, and decide whether you have one cause or several.

Use the first run as the source of truth. The original failure usually has the cleanest screenshot, browser log, console output, and network trace. A fast rerun can erase the clue you needed.

  1. Start with the latest failed run and scan for repeats. If five tests die on the same login step, treat that as one cluster.
  2. Check the evidence before rerunning anything. Screenshots, logs, and network errors often tell you whether the app broke, the test ran ahead, or a service timed out.
  3. Reproduce the top blocker once in a stable environment. Use fixed test data, the same browser version, and low CI load.
  4. Open one issue for each cause, not for each failed run. Duplicate tickets make a small problem look bigger than it is.
  5. Assign one owner and a short time limit for the first fix. Two days is often enough to patch the test, fix the bug, or decide on quarantine.

This routine keeps flaky tests from turning into a pile of duplicate tickets. It also stops the team from arguing too early about whether the app or the test is at fault.

Say checkout, saved cart, and order history all fail after login with the same 401 response. That is one auth problem, not three separate test problems. One owner can trace the token refresh flow, fix it, and clear several failures at once.

If one test blocks release every Friday, move it to the front of the line. Reproduce it once, collect the evidence, and decide fast. Teams get stuck when they spend half a day rerunning ten minor failures while one repeat blocker keeps holding delivery.

When to quarantine a test and when not to

Stabilize End to End Checks
Improve selectors, waits, and suite design without rewriting everything.

Quarantine is a safety valve, not a cleanup strategy. Use it when a test fails for reasons that do not match the product change under review, such as unstable test data, shared environment issues, slow page loads, or a third-party sandbox that times out.

Do not quarantine a test just because it is noisy. If the failure points to a real user risk, keep it in the main signal and fix it fast. A broken checkout path, failed signup, or bad password reset flow should stay visible even if the test itself is imperfect.

A simple rule helps. If the product likely works and the test setup likely broke, a short quarantine can make sense. If users might hit the same problem in production, quarantine hides the issue and makes release decisions worse.

When you quarantine a test, record four things in one shared place: the test name, the reason, the owner, and the review date. That can live in a ticket board, a spreadsheet, or a small table in the test report. The format matters less than the habit. If the reason is missing, the test will sit there for months and trust in the suite will drop even further.

Review dates matter more than most teams think. A quarantine list without an expiry turns into a graveyard. Pick a near date, often within one or two sprints, and ask one blunt question at review time: fix it, replace it, or delete it.

Keep revenue and security flows visible even when they are flaky. If a payment test fails twice a week because of brittle selectors, move it out of release gating if you must, but still show it on the team dashboard and discuss it in release review. Hiding it completely is how small test problems turn into missed bugs.

Teams often argue about whether quarantine is good or bad. It is neither. It is a temporary label. Used like a parking spot with a return date, it helps. Used like storage, it quietly breaks the suite.

A realistic weekly release example

On Monday, a checkout flow test turns red in CI. It covers the path every team worries about: add an item, enter payment details, click confirm, and wait for the receipt page. By Friday, that same test fails three times.

The first failure is real. The app sends the payment request, but the gateway reply never reaches the final order step. The order stays "pending," and a customer would hit a broken checkout. The team marks that bug as release blocking and fixes it before anything else.

On Wednesday, the test fails again. This time the payment goes through, but the script clicks "Confirm" while a loading overlay still sits on the button. On Thursday, it fails one more time for the same reason. Logs show a normal payment flow. Video shows the UI settling a fraction later than the test expects.

This is what a normal week looks like with flaky tests. The test name stays the same, but the causes do not. If the team treats all three failures as one problem, they waste time and make worse release calls.

So they make three separate decisions. They keep the real payment bug in the release blocker lane until checkout works in production-like runs. They quarantine the timing failure for seven days with a note that explains the race and names the owner. Then they fix the wait logic so the test waits for the overlay to disappear and for the receipt page to load.

Friday looks much calmer after that. The team does not hide the payment bug under a vague "known flake" label. It also does not stop the release for a timing issue that already has a short quarantine, a clear owner, and a simple fix.

The release moves forward because the actual risk got fixed. The noisy failure stays visible, but it no longer blocks delivery while the team repairs the test.

Fix patterns that remove most flakes

Calm Down Release Fridays
Find the blockers, set owners, and stop noisy tests from delaying shipping.

Most flaky tests come from a few boring problems, not mysterious bugs. Teams often spend hours blaming the framework when the real issue is weak selectors, messy data, or leftover state from a previous run.

Start with selectors and waits

A test should click and read what a user can actually see. If it depends on a deep CSS path, a random class name, or the third button in a row, a small UI cleanup can break it. Prefer selectors tied to labels, roles, button text, and other stable page elements.

Hard sleeps cause the same pain. A fixed five-second wait is too short on a slow runner and too long on a fast one. Wait for a clear app signal instead: a button becomes enabled, a toast appears, a request finishes, or a table row shows the new record.

Isolate each test run

Shared state poisons suites more than most teams admit. One test creates a customer, another edits that same customer, and a third fails because the name already changed. Each test needs its own data, its own setup, and a cleanup rule the team actually trusts.

A few habits remove a lot of noise. Create fresh records for each run instead of reusing old ones. Use unique IDs or timestamps so tests do not collide. Reset cookies, local storage, and feature flags between tests. Seed known data when the app needs a specific starting state. Keep parallel tests away from the same shared account.

Long end-to-end paths also fail more often because they depend on too many moving parts. If one test signs in, imports data, invites a user, changes billing, and checks a report, it tells you very little when step seven breaks. Split that path into smaller checks, and keep only one or two full-journey tests for the flows that can block a release.

If checkout fails every Friday, do not keep one giant test that builds a cart, edits an address, adds a coupon, and confirms payment. Keep a short checkout test for release blocking, then move coupon logic and address editing into their own checks.

That combination usually cuts noise fast. Fix selectors, wait for real signals, control the data, and isolate state. The tests that still fail will usually point to a real product bug, which is what the suite should do.

Mistakes that make flaky suites worse

A lot of teams make the problem worse by chasing green builds instead of clean signals. The build turns green for an hour, but nobody learns why the test failed. A week later, the same test blocks release again.

The first bad habit is quarantining every annoying test. Quarantine has a place, but it is not a cleanup plan. If a team hides every noisy test, the suite stops telling the truth. Soon you have a "passing" pipeline and no confidence in it.

Another common mistake is rerunning failures until one pass makes the problem disappear. That relieves stress in the moment, but it destroys evidence. If the first run failed because of a race condition, slow data setup, or an unstable dependency, repeated reruns only bury the pattern.

Teams also lose time when they mix product bugs and test bugs in the same ticket. A broken checkout flow and a broken selector are different problems. One needs a product fix. The other needs test maintenance. Put them together and ownership gets fuzzy fast.

Some habits look productive but create more noise. Teams change several tests at once when only one blocks releases. They accept unstable environments as normal background noise. They raise timeouts everywhere instead of finding the real wait condition. They close flaky test work after one temporary pass.

Small, targeted fixes work better. If one login test fails every Monday after deployment, fix that test and the setup around it first. Do not rewrite the whole auth suite because one case hurts the release.

Environment issues deserve the same attention as test code. If shared data resets late, queues lag, or a staging service restarts at random, the suite is telling you something real. Treating that as "just staging being staging" is how flaky tests become permanent.

A useful rule is simple: keep the signal, isolate the cause, and change the smallest thing that can prove the fix.

A short checklist for each failed test

Audit Your CI Noise
See which failures waste team time and which ones actually block delivery.

When a test fails, the team needs a few clear answers before anyone starts blaming the code, the test, or the environment. If you can get those answers in ten minutes, failures stop turning into long Slack threads.

Use a short checklist:

  • Write the cause in one plain sentence. For example: "The checkout test failed because the tax API mock returned no response." If nobody can explain the failure simply, the investigation is still too fuzzy.
  • Decide how much it hurts right now. Put failures that block revenue, signups, or the release at the top of the queue.
  • Check whether the same error appeared in more than one run. One odd timeout can be noise. The same broken step twice usually means something real changed.
  • Ask one person to reproduce it on a normal setup. If they need special data, a custom branch, or a lucky CI worker, the failure may come from test setup drift.
  • Assign one owner and one review date. Without both, quarantined tests sit for weeks and then surprise everyone again.

This is simple failure triage, but it changes the discussion. People stop arguing in general terms and start talking about one failure, one impact level, and one next step.

For example, the signup flow fails on Tuesday. The team sees the same network error in three runs, one engineer reproduces it locally, and the path blocks new accounts. That test goes straight to the top. On the same day, a report export test fails once, nobody can reproduce it, and the release does not depend on it. That one can wait for another run.

If a failed test does not have a clear cause, repeatability, and an owner, it is not ready for a real fix yet. It is just noise with a screenshot.

What to do next

If flaky tests slow releases every week, start smaller than you think. Do not try to clean the whole suite at once. Pull the last month of failures and find the five tests that blocked merges, delayed a release, or forced repeated reruns. That short list usually tells you more than a hundred noisy failures.

Then pick one sorting method and stick to it every week. Keep the labels plain: product bug, test bug, test data, environment, and timing. The exact names matter less than consistency. If the team changes the labels every few days, nobody sees patterns and the same debates return.

A weekly routine beats a big cleanup sprint. Refresh the top five blockers, review the quarantine list on the same day each week, remove tests from quarantine only after you prove they are stable, and retire tests that no longer protect anything important.

Make the current state easy to see. A small dashboard is enough. Show failure counts, which tests are quarantined, how often they fail, and whether they affected a release. One screen that everyone reads cuts down on blame because the team works from the same picture.

Keep the dashboard simple. A spreadsheet can work. So can a panel in the tooling you already use. The point is not polish. The point is that nobody has to guess which failures matter.

If the same few failures keep coming back, the problem often sits outside the test itself. It can be bad test data, slow environments, weak cleanup between runs, or a release process that hides breakage until late. An outside review can help. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor, and this kind of release, CI/CD, and infrastructure cleanup is exactly the sort of problem he helps teams untangle.

Start this week. Pick the five blockers, choose your labels, set a fixed quarantine review day, and publish one dashboard the team can check before the next release.

Frequently Asked Questions

What makes a test flaky?

A flaky test fails even when the product did not change in a meaningful way. Most of the time, timing problems, weak selectors, shared test data, leftover state, or shaky environments cause it.

Should I rerun a failed test right away?

No. Check the original run first because it usually has the best clues. A quick rerun can turn the build green and hide the screenshot, logs, or network error you needed to find the cause.

How should my team classify a failed end to end test?

Use one simple set of labels and stick to it: app bug, test bug, data issue, environment issue, or timing issue. That shared language stops blame loops and helps the right person take the next step.

Which flaky tests should we fix first?

Fix the tests that block merges, delay releases, or pull engineers away from shipping. A test that fails less often can still hurt more than a noisy one if it stalls a release for 45 minutes every time.

When does quarantine make sense?

Quarantine makes sense when the product likely works but the test setup does not, such as unstable data, a slow third party sandbox, or a runner issue. Give the test an owner and a review date so it does not sit there forever.

When should we avoid quarantine?

Keep it out of quarantine when users might hit the same problem in production. Checkout, signup, password reset, payments, and security paths need to stay visible even if the test itself needs cleanup.

What should we record when we quarantine a test?

Write down the test name, the reason, the owner, and the review date in one shared place. That small bit of discipline keeps quarantine temporary instead of turning it into storage for old problems.

What fixes remove a lot of flaky failures fast?

Start with selectors and waits. Use stable page signals, such as visible text, roles, enabled buttons, finished requests, or a loaded result, instead of deep CSS paths and fixed sleeps.

What does a good triage routine look like?

Look at the newest failed run, group similar failures, and decide whether one cause hit several tests. Then reproduce the top blocker once in a stable setup and open one issue per cause, not one per failed run.

How often should we review flaky tests?

Review the worst blockers every week, not once a quarter. A short weekly pass over failed runs, quarantined tests, and release impact helps you catch patterns before they become normal background noise.