Mar 28, 2026·8 min read

Reduce logging costs by keeping errors, not every success

Reduce logging costs by keeping rich failure evidence, sampling healthy requests, and cutting repetitive success logs that add spend, not insight.

Why success logs pile up fast

Most teams start with good intentions. They log every 200 OK, every completed background job, and every "done" message because it feels safe. Early on, that barely shows up. After traffic grows, the same habit turns into a quiet bill that keeps climbing.

Success paths usually happen far more often than failures. If an API handles 2 million healthy requests a day, even one short line per request becomes a huge stream of repeated text. Then the logging stack stores it, ships it, parses it, indexes it, and keeps copies for retention. That is how a tiny message turns into a real observability budget problem.

The cost is not only storage. Ingest fees, indexing work, transfer between services, and longer search times all pile up. Teams often notice the bill first, but the bigger daily pain is noise.

During an incident, people open logs to answer a simple question: what broke, where, and for whom? If the search results are packed with endless success lines, the useful evidence gets buried. A timeout, a bad payload, or a retry storm can sit between thousands of lines that all say the request finished normally.

Repetition also changes behavior. When people see the same harmless line a few thousand times, they stop reading it. That is a problem because logs only help when humans still trust them enough to pay attention.

A simple example makes this obvious. Imagine a signup flow that logs "user created successfully" for every normal account. On a busy day, that line tells you almost nothing, yet it may appear hundreds of thousands of times. If the email step fails for 0.5% of users, those few failure records matter far more than the mountain of healthy confirmations around them.

If you want to reduce logging costs, start with volume, not rare edge cases. In most systems, healthy traffic creates the biggest pile. The usual suspects are request success logs, finished cron jobs, queue workers that report every normal completion, and health checks that write the same line all day.

What deserves full detail

If you want to reduce logging costs, keep detail where it pays off: failures. When a request breaks, a short "500" line does not help much. You need enough evidence to tell the full story after the fact, even if nobody saw the incident live.

Start with the basics every failure needs. Keep the exact timestamp, request ID or trace ID, service name, route or job name, build version, and stack trace. Add the final error message and the status code the user or caller received. That gives your team a fast path from alert to root cause.

Some errors need more context because the first message is often vague. Capture details like:

retry count and backoff state
timeout length and which dependency timed out
validation errors with the field name and rule that failed
upstream status codes from databases, queues, auth services, or payment providers
safe business context such as tenant ID, region, or feature flag state

That last point matters. A log should explain why the failure happened, not just confirm that it happened. If a worker failed because Redis timed out after three retries, write that clearly. If an API call failed because the payload missed a required field, log the field name and rule, not the entire raw payload.

Keep the context that helps someone fix the issue in one pass. Drop bulky data that rarely helps, like full request bodies, giant response blobs, or every header. Smaller teams feel this fast. A clean failure log can save 20 minutes of digging across dashboards and traces.

Privacy rules still apply when the app is on fire. Do not store passwords, API keys, session tokens, raw personal data, or full payment details. Mask, hash, or remove them before the log leaves the app. If support needs to match a user record later, a partial identifier is usually enough.

Good failure logs are specific, compact, and safe. When something breaks at 10:42, your team should see what failed, where it failed, and what to try next without reading a wall of noise.

How to treat healthy traffic

Most healthy requests do not need a full log entry. If 50,000 requests finish exactly as expected, storing 50,000 near-identical success lines adds cost, not insight. Teams that want to reduce logging costs usually get the fastest win here.

For routine traffic, keep totals and timing as metrics instead of raw logs. Count successes by route, status code, region, and release version. Track latency percentiles and retry rates too. Those numbers tell you whether the system stays healthy without filling storage with endless "200 OK" records.

Then keep a small sample of successful requests. For a stable feature, 0.5% to 1% is often enough. That sample gives engineers real examples of headers, payload shapes, and timings when they need to compare a healthy case with a broken one.

New code needs closer watch. If your team just changed auth, billing, or a background job, raise the sample rate for that path for a short period. Five percent for a week is often more useful than one hundred percent forever. Once the release settles and the metrics stay normal, lower it again.

Keep a few clean reference examples for each major user path. Pick flows people use every day, such as sign in, search, checkout, or report export. Save enough detail to show what normal looks like: request metadata, timings, downstream calls, and the final result. When an incident starts, those examples save time because the team can compare a bad trace with a good one right away.

A simple default works well for many teams:

Count every successful request as a metric.
Sample a thin slice of stable traffic.
Raise sampling for changed code and risky releases.
Keep a handful of known-good examples for major paths.

This approach gives you coverage without noise. You still see traffic volume, speed, and success rates across the whole system. You still have real success traces when you need them. You just stop paying to store the same healthy event over and over again.

Build rules before you delete noise

Cutting logs without rules is how teams lose the one record they need during an outage. Start with the flows users actually notice: signup, login, checkout, password reset, billing, imports, and API calls that bring in money or move sensitive data.

Write those flows down in one place. A short file in the repo works well because engineers can review changes with the code. This is the part many teams skip when they try to reduce logging costs, and they usually regret it later.

For each event in those flows, choose one of three levels:

Full log when failure details matter for debugging, audits, or support
Sampled log when the event is healthy and very common
Metric only when you only need counts, latency, or error rate

Keep the rule simple. A failed payment needs request context, error details, and trace IDs. A successful health check probably needs a counter and latency metric, not a full JSON blob every few seconds.

Retention should follow value, not one global number. High-detail failure logs often earn a longer window because people read them during incidents. Sampled success logs can expire much sooner. Metrics can stay longer if they are cheap and help with trends.

A practical application logging policy might look like this: keep auth failures and billing errors for 30 to 90 days, keep sampled success logs for 3 to 7 days, and keep aggregate metrics much longer. The exact numbers depend on your support load, compliance needs, and budget.

You also need a clear owner. If nobody owns the rules, anyone can add noisy logs and nobody cleans them up. Pick one engineer or team lead to approve logging changes, and ask product or support for input on flows that affect customers. If a fractional CTO helps set the first version, an internal owner should still control day-to-day updates.

That gives you a system people can follow. When an alert fires, the team knows which evidence exists, how long it stays, and who decided the rule.

Roll it out step by step

Start with numbers. Pull 7 to 30 days of log volume by service, environment, and log level, then attach a real cost to each bucket. Teams often debate logging in the abstract. The bill ends that debate fast, and it shows which service creates the most repetitive success noise.

Include retention in that baseline. A chatty service kept for 30 days can cost more than a busier service kept for 3. That simple view tells you where to act first.

Change one service first, not ten. Pick the noisiest path that still feels safe, like health checks, background jobs, or a signup success message that fires all day. Keep full error logs there. For healthy requests, keep a small sample and store only the fields people actually search, such as request ID, route, latency, and status code.

Test the blind spots

After the first cut, check everything that depended on those success logs. Some alerts read success counts from logs instead of metrics. Some dashboards break because they were built on log queries. During an incident, the on-call person may search for a message that no longer exists. Fix those gaps before you expand the policy.

A short review after each change keeps surprises small:

Compare log volume before and after
Check ingestion and storage cost
Time a few common searches
Replay a recent incident or run a drill
Ask the on-call team what got harder

Then move to the next service. That pace may feel slow, but it is much safer than deleting noise across the whole stack and discovering the missing evidence during an outage. Review the new volume, search speed, and monthly cost after each step. If people can still investigate failures quickly, you can expand the same application logging policy to the next noisy service and reduce logging costs without losing the logs that matter.

A signup flow shows the logging tradeoff clearly. Most signups work the same way every day, so writing a full log for every success burns storage fast and gives almost nothing back. If you want to reduce logging costs, this is often the first place to clean up.

Picture a simple path: a user submits an email and password, your app checks the form, creates the account, sends a verification email, and returns a success message. That path may run thousands of times a day. The useful part is not a giant pile of identical success records. The useful part is knowing when the path breaks, slows down, or changes.

Keep full detail when something goes wrong. If the form fails validation, log the exact field errors, the request ID, the app version, and any rule that blocked the signup. If the email provider rejects the verification email or times out, keep the whole failure trail. That evidence helps a team fix the real problem instead of guessing.

For successful signups, log only a small sample. You might keep 1 out of every 100 successful requests, with a request ID and a short summary of the steps that passed. That gives you enough real examples to inspect later without paying to store every duplicate success.

Do not hide volume in the process. Count every successful signup in metrics, even when you do not keep the full log record. Your dashboard should still show total signups, validation failure rate, email send failures, and latency. Metrics show trends. Logs explain exceptions.

One more habit helps a lot: save one clean trace for the full signup path on each release. Run a test signup after deployment and keep that trace as a known-good example. When a later bug appears, your team can compare the broken request with a trace that proves the path worked in the current version.

That setup keeps the signal. It drops the noise. More importantly, it leaves you with the records people actually read when a signup stops working.

Mistakes teams make

The cheapest log is the one you never store, but teams often cut the wrong lines first. To reduce logging costs, keep the evidence that explains failure, and trim the noise that adds nothing.

A common mistake is deleting context along with repetitive success messages. If an error log says "payment failed" but drops the request path, tenant ID, response code, retry count, and timing, nobody can diagnose it fast. You save a little storage and lose hours in incident response.

Teams also go too far with sampling. They keep only a tiny slice of normal traffic, then wonder why latency spikes never show up in logs. If one slow request out of 500 matters to users, your sample rate has to catch it often enough to spot a pattern.

Retention causes another quiet mess. Many teams pick one number, like 30 days, and apply it to everything. That sounds tidy, but it wastes money on low-value entries and deletes failure history too soon. Error logs, audit trails, and sampled success logs rarely need the same lifespan.

Debug output is another budget killer. A team turns it on during a release, fixes the issue, and forgets to turn it off. Two weeks later, every request writes extra payloads, headers, and internal state that nobody reads. Storage climbs, search gets slower, and useful signals sink in the noise.

Support work suffers when teams remove all proof of successful paths. You do not need every successful request forever, but you usually need some evidence. A customer says, "I signed up and never got the email." If you kept a small sample of healthy signup events with message IDs and timestamps, support can check what happened without guessing.

A few warning signs show up early:

Search results are full of repeated 200 OK lines
On-call engineers open logs and still cannot explain the failure
Storage grows fast after every release
Support asks engineering for database access to answer basic cases

Oleg Sotnikov often works with teams that run lean infrastructure, and this is one place where discipline matters. Logging less is not the goal. Logging what helps is.

Quick checks before you ship

A lean logging setup should still answer the questions your team gets at 2 a.m. If a customer says "payment failed" or "I never got the email," your logs should show what happened without forcing anyone to guess.

Run a short test window before you push the new policy to production. Use real requests from staging or a small slice of live traffic, then check the results like a skeptic.

Trigger one real failure on purpose. Make sure the logs show the request path, timestamp, service name, error message, and enough context to explain the user problem from the logs alone.
Follow one healthy request all the way through. You do not need every success event, but you should still be able to trace a normal request across services with one request ID or trace ID.
Fire a few alerts in test mode. If your alerts depended on noisy success logs, they may go quiet after cleanup. Check that they still trigger from error rates, latency, or failed jobs.
Inspect the payloads you kept. Remove tokens, passwords, session data, email addresses, and anything else a support agent does not need to read.
Compare cost before and after. If storage, ingest, or query spend did not drop during the test window, you probably kept too much noise.

One small example helps. Say a signup request fails because the email service times out. You should see the signup attempt, the timeout, the retry result, and the final response sent to the user. For a normal signup, one sampled trace and a few summary metrics are usually enough.

This is where teams often miss the point. They reduce logging costs, but they also remove the evidence needed to debug real problems. Good cleanup cuts repetition, not clarity.

If your team can explain a broken request, trace a healthy one, trust the alerts, confirm that private data is gone, and see lower spend in the test window, the policy is ready for production.

What to do next

Start with a one-page application logging policy. Keep it short enough that engineers will read it and specific enough that they can use it during a normal sprint. Define three things: which failures need full detail, which healthy flows you only sample, and how long each log type stays in storage.

A simple draft often works better than a perfect one. If a service owner can answer "do we log this, sample it, or drop it?" in under a minute, the policy is doing its job.

Keep complete error logs for failed requests, exceptions, retries, timeouts, and any action tied to money, auth, or data changes.
Sample healthy requests for common flows such as login, search, and signup. Pick a low fixed rate and keep it consistent.
Set shorter retention for noisy operational logs and longer retention for audit or security events.
Write down the fields every service must keep, such as request ID, user or tenant ID when allowed, service name, version, and timing.

Then run a two-week trial on one service, not the whole stack. Compare log volume, storage spend, and alert quality before and after the change. If you reduce logging costs but engineers lose the details they need during incidents, adjust the sampling rate or add one more field to failure logs.

Ask support and engineering the same plain question: "What missing detail slows you down when something breaks?" Their answers usually point to a small set of fields that matter a lot more than thousands of success lines. Support may need account IDs and timestamps. Engineers may need trace IDs, release version, and the exact upstream error.

Log sampling and retention rules also need an owner. Put one person in charge of reviewing new services and checking whether teams follow the policy. Without that, noisy defaults creep back in within a month.

If your team wants an outside review, Oleg Sotnikov can look at log noise, infra spend, and team habits in a Fractional CTO consultation. That kind of review helps most when costs keep rising and nobody can explain which logs still earn their place.