Error tracking sampling plan for high-volume products
Build an error tracking sampling plan that keeps critical failures, trims noisy routes, and matches event volume to severity and customer impact.

Why high event volume gets expensive fast
At high traffic, error volume rarely rises in a neat line. One noisy route, worker, or webhook can throw the same failure thousands of times a minute. Then the issue you actually need to fix gets buried.
A broken checkout path that hits 200 users can look small next to a polling job that fails 50,000 times and hurts nobody. That is why flat sampling usually fails. If you keep 10% of everything, you keep too much noise and throw away too many rare but serious failures. The math looks fair. The outcome is not.
Engineering pain and customer pain are different. A noisy internal error can fill dashboards and wake people up. A lower-volume error on signup, billing, or checkout may affect fewer events but more people. When teams treat raw count as the main signal, they often fix the loudest route first and miss the one that costs revenue or trust.
Cost usually rises in the same places. Storage fills with repeated events, alert rules fire too often, and engineers waste time digging through duplicates before they find the issue that matters.
One global rule sounds simple, but it wastes both money and attention. You need tighter control on noisy routes and fuller retention on paths where a single failure matters.
Teams with very high volume learn this quickly. One bad webhook consumer or retry loop can generate more events in an hour than the rest of the app creates in a day. If you keep all of that, you pay to store noise, page people for noise, and spend real engineering time reading noise. The bill grows first, but the bigger loss is focus.
Start with the failures you never drop
Start with a short list of failures you will always keep, even if volume spikes. If an error can lock users out, block payment, lose data, or break a core promise of the product, keep every event.
Percentages come later. First protect the paths where one missed signal can turn into lost revenue, support load, or a trust problem that is hard to repair.
Most teams should keep full capture for failures in sign-in, password reset, and session handling, along with checkout, billing, refunds, subscription changes, and any write that can lose or corrupt customer data. Account changes with security or permission risk belong in the same bucket.
Apply the same rule to anything new. When an error shows up for the first time, keep it at 100% until someone reviews it. New issues often look small because grouping is messy, the route label is wrong, or only one customer has hit it so far. If you sample too early, you lose the best evidence.
Treat sudden spikes the same way. If one crash hits many users in a short window, stop sampling that group and keep the full stream until the team understands the cause. Count can mislead, but a sharp burst across the same route, release, or device type usually points to real customer impact.
Write these no-drop rules down before you tune any percentages. Keep them short and specific. Name the routes, job types, and error groups that stay at full capture, and say who can change that.
A one-page rule sheet beats guesswork. It helps when traffic jumps, when a new teammate edits observability settings, or when finance asks why some events stay unsampled. The answer is simple: missing these failures costs more than storing them.
Group events by route and job type
If every error lands in one bucket, the busiest parts of your product bury the failures that cost you money. Start with traffic shape, not raw totals.
Split events into groups that match how the product actually runs. Web pages, API endpoints, background workers, and scheduled jobs fail for different reasons and create very different noise levels.
A simple split is enough at first:
- web routes for user-facing pages
- API routes for frontend and app calls
- workers for queues, imports, and async tasks
- scheduled jobs for cron tasks, cleanups, and reports
This pays off fast. A broken checkout page should never compete with a noisy image resize worker. An admin screen used by five people a day does not need the same sampling rate as a payment flow or signup path.
Clean up route names before you sample. If your tracker stores /orders/18372, /orders/18373, and /orders/18374 as different routes, you waste volume on URL fragments instead of real patterns. Normalize them into stable names like /orders/:id, POST /api/payments, or worker.invoice_retry.
Some groups deserve aggressive filtering. Health checks often fail in bursts and tell you very little after the first few hits. Retry loops can do the same, especially when an upstream service slows down and every worker logs the same timeout.
Keep those noisy paths separate and sample them hard. Keep revenue paths lightly sampled, or do not sample them at all. Many teams also put admin pages in the middle: enough data to debug issues, but not enough to let internal traffic dominate the budget.
A practical setup might keep 100% of checkout, login, and billing errors, 25% of admin page errors, and 1% of health check failures after the first event. That is usually far better than one global rate for everything.
Add severity rules that match real risk
Severity rules should follow customer harm, not just the label attached to the event. Keep full detail for failures that break access, damage data, block payments, or point to a security problem.
That usually means full capture for fatal crashes, auth failures, permission bugs, payment errors, and security events. When one of these fires, you want every stack trace, release tag, request path, and customer context you can safely store.
Warnings need a much tighter cap. In many products, warnings come from retries, browser quirks, noisy SDKs, and expected edge cases. If you capture them like exceptions, they bury the errors that actually stop work.
A simple starting point:
- fatal and security events: full capture
- unhandled exceptions in customer-facing flows: high sample rate
- background job warnings and temporary client warnings: low sample rate
- known validation noise: very low sample rate or count-only tracking
Deployments should change the rules for a while. If a route that was quiet yesterday starts throwing the same exception hundreds of times right after a release, raise the capture rate for that error family. The first few events tell you what broke. The next wave tells you how wide the damage is.
This matters even more at very high volume. Large production systems can flood a tracker with repeated post-deploy errors in minutes. A temporary rule that boosts capture for new or fast-growing exceptions gives you enough detail to debug, then you can lower the rate again after the fix ships.
Low-risk validation noise should move the other way. Bad coupon codes, empty form fields, expired invite links, or rejected file types can flood a project without telling you much. Keep a thin sample so you can spot changes in volume or message shape, but do not store every copy.
A good rule of thumb is simple: if the event can cost a customer time, money, access, or data, keep more of it. If it mostly reflects expected bad input, keep less and watch the trend.
Measure customer impact, not just count
Raw event count can fool you. A bug that fires 10,000 times from one broken retry loop looks huge in a dashboard, but it may affect only one account. A payment error that hits 300 people once each is usually the bigger problem.
Track affected users next to total events. Count unique users, accounts, or workspaces for each error group over a short window. That simple change tells you whether you have one noisy failure or a broad user problem.
The blocked action matters just as much. If the error stops sign-up, payment, or your support team from doing its work, keep far more of it than you keep for a failure in a low-risk background job. Sampling should follow user pain, not log volume.
A small set of fields is often enough:
- unique users affected
- whether it blocked sign-up, payment, or support work
- which plan or account the user belongs to
- whether the same user triggered it again and again
The difference between repeat noise and broad damage is easy to miss. One user hit 10,000 times often means a loop, a stuck client, or one bad record. Ten thousand users hit once means a release problem, a bad dependency, or a broken route. You should sample those two cases very differently.
Paid plans may need their own rule. If a failure touches enterprise customers, large contracts, or accounts your team watches closely, keep more context and more samples even when the event count is low. That gives support and engineering enough detail to act fast.
Lean teams usually learn this early. Storing every repeated stack trace gets expensive, but dropping errors that block checkout or lock out a paying workspace costs more. Count events, yes. Then ask who got hurt, what they were trying to do, and whether the business can feel it by the end of the day.
Build the plan step by step
Start with a plain table. List every API route, background job, webhook, and scheduled task in one place. Next to each item, note the business cost of failure. Checkout, login, billing sync, and import jobs usually sit near the top. A noisy image resize worker sits much lower.
Then give each group a default sample rate. Keep the first version simple. You might keep 100% of checkout errors, 50% of account settings errors, 10% of search timeouts, and 1% of low-risk batch noise. That first pass usually does most of the cost control.
A good rollout sequence looks like this:
- Group events by route or job type.
- Set one default rate for each group.
- Add severity overrides.
- Add customer-impact overrides.
- Test the rules on recent data.
Add severity after route rules, not before. A warning on checkout may matter less than a fatal crash in a background billing job, or it may matter more if it blocks orders. Decide what "fatal", "error", and "warning" mean in your product, then override the default rate when the label matches real risk. If your team uses severity labels loosely, fix that first. Messy labels lead to messy sampling.
Customer impact should break ties. Keep every error that blocks payment, breaks sign-in, or affects many accounts in a short window, even if that route usually has a low sample rate. If one customer sends half your traffic, add a rule for account tier or affected users so raw volume does not drown out the damage.
Last, run the rules against last week's event data before you ship them. Check how much volume you drop, which serious failures you still keep, and whether any route now looks quieter than it really is.
Teams that run lean infrastructure tend to avoid bad surprises by doing this dry run first. It takes about an hour and can save days of guessing later. That kind of discipline is common in AI-first engineering setups too. Oleg Sotnikov, for example, writes about running production systems with lean teams and tight observability control on oleg.is.
A simple example from a product team
A small product team runs a subscription app with a busy checkout flow, daily background sync jobs, and a web client that throws plenty of harmless browser noise. Their error volume jumps every time they ship a release or run a promotion, so a flat sampling rate stops working fast.
They fix that by changing sampling by route, severity, and customer impact.
During a sales week, they keep every exception from checkout. If payment fails, cart totals break, or the order confirmation page crashes, they want every event. Even a small bug there can cost real money in a few hours, so saving only 10% would be a bad trade.
They treat background sync very differently. A sync timeout that retries on its own matters less than a checkout failure, especially when thousands of jobs run each hour. The team keeps a low sample rate for those timeout errors, enough to spot a pattern without paying to store the same issue again and again.
A release changes the rules again. After a new version goes live, support reports that several first-week customers hit errors during onboarding. The team raises capture for events tied to new accounts and recent signups. That helps them answer a better question than "How many errors fired?" They ask, "Are new customers getting blocked right after release?"
They also clean up noisy browser errors once they confirm the cause. Say a browser extension injects bad JavaScript and floods the tracker with duplicate exceptions. After the team verifies that the problem does not come from their app, they drop most of that noise and keep a tiny sample for monitoring. The issue stays visible, but it no longer drowns real failures.
The result is straightforward. They spend their budget on checkout, releases, and customer-facing breakage. They sample repetitive low-risk failures hard. And when something starts hurting new customers, they turn capture back up before the dashboard fills with junk.
Mistakes that hide the failures that matter
A flat sample rate sounds tidy, but it usually hides the wrong things. If every route, worker, and background job gets the same rule, the loudest parts of the product win. A login failure that blocks revenue should not compete with a chatty health check that fails all day and fixes itself on the next try.
This is one of the most common ways a sampling plan goes wrong. Teams tune for total volume, then assume the remaining sample still tells the truth. Often it does not. It tells you where noise is high, not where pain is high.
Retries make this worse. One bad dependency call can trigger five retries in an API route, ten more in a queue worker, and another wave from a scheduled job. You do not have fifteen separate problems. You have one problem multiplied by code behavior. If you sample those events without retry awareness, the repeated failure can drown out rarer crashes that hit fewer users but block them completely.
Warnings can trick you too. Products often keep a large share of noisy warnings because they look harmless, then drop rare fatal errors because the budget is already gone. That trade is backward. A warning that appears 50,000 times and hurts nobody matters less than a crash that breaks checkout for 12 paying customers.
Product changes create another blind spot. Routes change, new jobs appear, and old assumptions stay in the rules for months. A background task that used to be low risk can become user-facing after one product update. If nobody revisits the rules, the plan slowly stops matching the product.
A few habits help:
- check whether repeated events come from retries, loops, or duplicate reporting
- compare dropped events against blocked actions, not raw count
- keep rare crashes longer than common warnings
- review rules after major releases, pricing changes, and new workflows
Volume matters because it affects cost. Customer impact matters more. If a small number of users cannot sign in, pay, export data, or finish a core task, keep those failures even when the graph looks quiet.
Quick checks before you ship new rules
Before you roll out new sampling rules, inspect the busiest parts of the product by hand. Start with the top ten routes or job types by event count. You want to see where volume really comes from, not where you assume it comes from.
This quick pass often finds one noisy endpoint that floods the system and a few quiet paths that break real user flows. If search throws 20,000 harmless parse errors but checkout throws 40 payment failures, the second one needs more protection even with lower volume.
First-seen errors need special attention. New failures are easy to miss when sampling gets aggressive because they begin as a single event. Keep first-seen issues visible at full capture, or close to it, until you know whether they repeat.
Protect the routes users notice first. Login, signup, billing, checkout, password reset, and account access usually belong in the do-not-drop group. A small spike there can turn into support tickets, refunds, or churn long before your total event count looks scary.
Counts alone can fool you. Compare sampled totals with user impact from the same time window. Ask simple questions: how many sessions hit the error, how many users got blocked, and did the failure stop a task people were trying to finish? A background job can produce huge noise. A broken login screen can hurt fewer events and many more people.
One last check matters for the team on call. Severe failures still need to trigger alerts after the new rules go live. Test that path on purpose. Send a sample high-severity error through the routes you protect, and confirm that paging, dashboards, and triage still work the way the team expects.
If these checks pass, the rules are probably safe enough to ship. If even one check looks off, pause and tune the filters before noisy routes bury the failures that matter.
What to review next
Sampling rules age faster than most teams expect. A launch can double traffic on one route, a migration can move failures into a new worker, and a pricing change can bring a different mix of customers and usage. If the rules stay frozen, you often keep the cheap noise and miss the failures people actually feel.
Check your rates after every launch, migration, and pricing update, even when the release looks calm. A route that was safe to sample last month may now touch checkout, billing, or a busy API path. One small product change can turn a minor error into a support problem.
Keep every rule in one short table that the whole team can read in under a minute. It should show the route or job name, which errors stay at 100%, which errors are sampled, the current sample rate, and the review date.
This shared view cuts confusion during incidents. Product, support, and engineering can all see the same rule set instead of guessing why one issue has full detail and another has only a handful of events.
Track three numbers together: cost, missed incidents, and review time. Cost alone pushes teams to sample too hard. Review time alone pushes them to keep too much. Missed incidents tell you whether the plan still protects the routes and jobs that matter.
A simple example makes this obvious. If a new self-serve pricing tier brings many more small customers, your login and billing routes may spike in volume overnight. The right move may be to lower sampling on harmless retries, while keeping full capture for payment failures and account lockouts until traffic settles.
If the rules keep drifting or volume is hard to control, it helps to get a second opinion from someone who works close to production systems. Oleg Sotnikov does that kind of work as a Fractional CTO, with a focus on infrastructure cost, AI-first development, and practical observability for startup and small business teams.
Frequently Asked Questions
Why is a flat sampling rate a bad idea?
One global rate looks simple, but it keeps too much noise and drops too many useful events. A noisy worker or webhook can flood your tracker, while a smaller checkout or login failure gets lost even though it hurts real users more.
Which errors should I always keep at 100%?
Keep full capture for anything that blocks sign-in, payment, checkout, password reset, account access, or data writes that can lose or corrupt customer data. You should also keep security and permission failures at 100%, because one missed event there can cost more than storing the extra volume.
Should I keep first-seen errors at full capture?
Yes. Keep first-seen errors unsampled until someone reviews them. New issues often look small at first because grouping is messy, labels are wrong, or only one customer has hit them so far.
How should I group events before I set sample rates?
Split events by how your product actually runs: web routes, API routes, background workers, and scheduled jobs. Then normalize names like /orders/18372 into stable patterns such as /orders/:id so duplicates do not waste your budget.
What sample rates make sense as a starting point?
Start with business risk, not math. Many teams do well with 100% for checkout, login, and billing, a middle rate for admin paths, and a very low rate for health checks, retries, and other noisy low-risk paths.
What should I do with warnings and validation noise?
Treat warnings much more aggressively than exceptions. Keep a thin sample for browser quirks, retries, and expected validation failures so you can spot trends, but do not let them bury errors that stop users from finishing real tasks.
How do I measure impact instead of just counting errors?
Track unique users, accounts, or workspaces next to total events. A loop that hits one user 10,000 times usually matters less than a payment error that hits 300 users once each.
What should I check before I ship new sampling rules?
Look at the busiest routes and jobs by hand, then run the rules against recent event data before you turn them on. After that, send a test high-severity error through protected paths and confirm alerts, dashboards, and triage still work.
How often should I review my sampling plan?
Review them after launches, migrations, pricing changes, and any product update that changes traffic shape. If you never revisit the rules, they stop matching the product and you end up storing cheap noise while missing customer-facing failures.
How do retries and loops distort error volume?
Retries, loops, and duplicate reporting can make one failure look like many separate problems. If you do not account for that, repeated noise eats your budget and hides rarer errors that block users completely.