Nov 13, 2024·7 min read

Observability cost audit for teams with rising bills

Run an observability cost audit to spot noisy labels, duplicate telemetry, and bad retention rules before logs and metrics cut into margin.

Observability cost audit for teams with rising bills

Why spend rises before traffic does

Observability bills often rise for a simple reason: telemetry grows by combinations, not just by visits. Traffic can stay flat while storage, indexing, and query load double in a month.

Metrics are a common cause. Add one more label to a busy metric and you can create a flood of time series. A counter split by service and region might stay manageable. Add build_id, tenant_id, or user_id, and the count jumps fast. That's metric label cardinality in plain English: each new label value creates more series to store and query.

Logs waste money in a different way. Teams add a new agent, SDK, or cloud forwarder during a migration, then forget to remove the old path. Dashboards still work, so nobody notices. The bill notices.

Retention is another quiet leak. Many tools keep everything for the default period even when the data is only useful for a few days. Debug logs might help right after a release, but keeping them for 30 or 90 days rarely makes sense if nobody reads them after the first incident window.

The same problems show up again and again. A developer adds a high-cardinality label to a busy metric. Two collectors send the same logs from one service. Staging or preview environments ship almost as much data as production. Trace and debug data sit in storage far longer than anyone needs.

A small SaaS team can hit all four without adding a single customer. One deploy changes metric labels, another adds a second log pipeline, and default retention stays untouched. The next invoice looks like growth, but much of it is waste.

Start the audit with structure, not traffic charts. Series count, ingestion paths, and retention usually explain the jump faster than a traffic report.

Map what you collect today

Before you change sampling, retention, or dashboards, write down every telemetry stream that enters your stack. This part is dull, but it usually finds the first waste. Most teams know the big buckets and still miss side paths like an old shipper from a past migration or browser errors sent to two tools.

A simple table works better than a diagram. Give each row one stream, not one vendor account. Split application logs, infrastructure metrics, traces, browser events, database logs, and error events into separate rows. If you use Prometheus, Loki, Grafana, Sentry, OpenTelemetry, or cloud services, note where each stream lands and whether another pipeline keeps a copy.

Keep the columns simple and consistent: source and scope, where the data goes, daily volume, monthly cost, retention days, purpose, and owner. A row like "API logs from production" or "browser errors from checkout" is clear enough.

The owner column matters more than teams expect. Unowned telemetry tends to stay forever, even when nobody reads it. People keep paying for it because no one feels allowed to delete it or shorten retention.

Be strict about purpose. If a stream only helps during incident response, you probably do not need to keep it for 90 days. If finance or compliance needs an audit trail, keep that separate from logs you only use to debug a failed deploy. When those use cases get mixed together, teams usually end up with one bad rule: keep everything too long.

Growing SaaS teams often find the same pattern once they put everything in one table. Access logs sit in one tool for search, a second system stores a duplicate copy for alerts, and raw files stay in object storage "just in case." Each piece sounds reasonable on its own. Together, the overlap is obvious.

Once you can see every stream, where it goes, what it costs, and why it exists, the next cuts get much easier and much safer.

Audit label counts step by step

Label counts are often the best place to start because label explosion can raise your bill long before traffic does. Each new label value creates another time series. One metric can turn into thousands if you attach too much detail to it.

Start by exporting the metrics with the highest series count from your monitoring stack. In Prometheus-style systems, cardinality views make this easy. If you use Grafana or another dashboard layer, pull the same data into a sheet and add two useful columns: who uses this metric, and which alert or dashboard depends on it.

Then sort each metric by the labels with the most unique values. You want to see which labels multiply the combinations. Labels like service, region, or status code usually stay small. Trouble starts when a label changes on almost every request.

A few labels deserve suspicion right away:

  • user IDs
  • request IDs
  • session IDs
  • raw URLs with IDs or query strings
  • any free-form text turned into a label

A metric like http_requests_total looks harmless until path=/product/48291 becomes a unique series for every item page. Change that to a route name such as /product/:id, and the count drops fast. If you need more detail, put it in logs or traces, not in metric labels.

Some fixes are simple. Replace exact values with buckets. Use response_size_bucket=small|medium|large instead of exact byte counts. Group customers by plan or segment instead of customer ID. Keep detailed data only for sampled requests when you need deeper debugging.

Do not ship label changes blindly. Dashboards and alerts often depend on the old label names and values. Before you roll out changes, copy the most-used dashboards, update them, and compare the old and new views for a day or two. Do the same for alerts. If an alert used to fire on a raw URL and now uses a route name, make sure it still catches the same problem.

This work is tedious, but it pays back fast. In many teams, a handful of noisy labels drive most of the growth. Clean those up first and the rest of the bill becomes much easier to manage.

Find duplicate ingestion

Duplicate ingestion is often the fastest place to cut waste. Teams add tools over time, and old paths keep running after new ones go live. One error event, one log line, or one metric lands in two or three systems, and you pay for every copy.

Start with a plain inventory of every collector and sender. Check hosts, containers, and serverless jobs. Many teams run a host agent, a cluster sidecar, and an app SDK at the same time without meaning to.

Go through each layer in order. List the agents on VMs and bare metal hosts. Check Kubernetes DaemonSets, sidecars, and log shippers. Review app code for OpenTelemetry and vendor SDKs. Inspect serverless functions and batch jobs for hidden wrappers. This does not take long, and it often explains a surprising amount of spend.

Then trace one real event from end to end. Pick a single log line with a unique request ID. Follow it from the app, through collectors, queues, processors, and exporters, until it reaches billable storage. If the same record shows up twice, ask what each copy does. Sometimes one path feeds search and another feeds alerts. Often both copies do the same job.

Mixed setups create most of the overlap. A team sends traces with OpenTelemetry, keeps a vendor tracing SDK in the app, and also runs an agent that scrapes logs and turns them into events. That feels safe, but it often adds cost without adding coverage.

Mirrored pipelines deserve extra scrutiny. They usually appear after migrations, acquisitions, or a rushed incident fix that nobody cleaned up later. If a duplicate path does not support a clear need, turn it off in a controlled test and watch alerts, dashboards, and incident flow for a few days.

One owner per telemetry path solves a lot of this. Give each path a named owner and a short written purpose. If nobody can explain why a pipeline exists, that usually tells you enough.

Set retention by real use case

Review your noisiest service
Oleg can inspect one service and show where labels, logs, or traces waste money.

One retention number for every log and metric is the lazy option, and it usually costs too much. Most teams keep noisy data far longer than anyone reads it, while the data that matters for audits or incident review gets buried in the same pile.

Start with a simple split: what people watch every day, what they check only during an incident, and what they keep for legal or finance reasons. That small change often cuts storage fast without hurting visibility.

Debug logs are usually the first place to trim. They are noisy, repetitive, and useful for a short window while a team fixes a fresh problem. If nobody looks at those logs after a week, keeping them for 30 or 90 days is just paying rent on old noise.

Security events, billing records, and incident timelines are different. You may need them months later because an auditor asks, a customer disputes a charge, or the team needs to reconstruct what happened during an outage. Give those streams a longer window on purpose, not by accident.

A practical policy often looks like this:

  • Debug and trace-heavy application logs: 3 to 7 days
  • Standard app and infrastructure logs: 14 to 30 days
  • Incident review data: 30 to 90 days
  • Security and access logs: longer, based on risk and policy
  • Finance-related events: as long as accounting or legal work requires

If you still need old data for rare lookups, move it to cheaper storage instead of keeping it in your hot observability system. Search will be slower, but that is usually fine for data someone checks twice a year.

Teams using Prometheus, Loki, or Sentry often save more with stream-by-stream rules than with one global setting. It looks less tidy on paper, but it matches how people actually work.

Review retention whenever you add a new product area, enter a regulated market, or change how customers pay. Bills drift because retention drifts.

A simple example from a growing SaaS team

A small SaaS team ran a project management app with a steady bill for logs and metrics. Then they added background jobs for imports, email syncing, and report generation. Traffic went up by about 20%, but their observability bill jumped from roughly $800 a month to almost $2,400.

That gap bothered them because the product did not change much for users. The audit showed that usage had grown, but collection rules had grown much faster.

The first problem was metric label cardinality. Their API metrics stored raw request paths, so /projects/41, /projects/42, and /projects/43/tasks/9 all became separate series. Once job workers started calling internal endpoints with IDs in every path, the number of time series exploded.

The second problem was duplicate telemetry. The team had installed a host agent on each VM, then later added telemetry through their container stack. Both agents sent similar CPU, memory, disk, and process data. They paid twice for data they only looked at once.

Logs added a third layer of waste. The app kept 30 days of verbose application logs, including debug entries for routine background jobs. When the team reviewed actual incidents, they found a simpler truth: most checks happened within the first 72 hours. After that, they rarely opened those noisy logs again.

They cleaned things up in a week. They replaced raw paths with route templates such as /projects/:id/tasks/:id. They removed one set of host agents. They split log retention by use case: three days for verbose debug logs, 14 days for normal app logs, and 30 days only for audit and security records.

The team worried that alerts would go blind after the cleanup. That did not happen. Their dashboards still showed latency, error rate, queue delay, and host health. On-call still had enough context to fix problems.

The bill fell to about $1,050 the next month. That is why cost control usually starts with three plain questions: which labels create too many series, where are two tools sending the same data, and how long do people really need each log type?

Mistakes that keep the bill high

Check your whole stack
Fold observability into a broader architecture and infrastructure cost review.

A lot of teams do not have a traffic problem. They have a collection problem. Bills rise because data keeps multiplying in small, ordinary ways, and nobody stops to trim it.

One common mistake is mixing staging and production telemetry in the same pipeline. That sounds harmless until a noisy test run floods your logs, skews dashboards, and burns retention on data nobody needs next month. Staging should help teams move faster, not quietly raise the production bill.

Another expensive habit is keeping every trace for the same length of time. Most teams do not need a full history of healthy requests. They usually need longer retention for errors, rare spikes, and a small sample of normal traffic for comparison.

Label growth causes a slower kind of damage. Someone adds labels for convenience - build ID, request path, customer note, temporary debug field - and nobody removes them later. Then cardinality climbs, queries slow down, and storage costs creep up month after month. This leak is easy to miss because each label looks small on its own.

Costs also jump when every team installs its own collector, agent, or side pipeline. That creates duplicate telemetry, inconsistent sampling, and three versions of the same metric with slightly different names. One team thinks it is being careful. Across the company, it turns into waste.

The last mistake is simple: nobody reviews spend until renewal time. By then, bad habits have turned into defaults.

A monthly review should catch staging data flowing into long-term production storage, full trace retention for routine requests, labels nobody uses in alerts or debugging, multiple collectors sending the same events, and any cost spike with no owner assigned.

Quick checks before the next invoice

Add observability guardrails
Set label rules, budget caps, and ownership before costs drift again.

A short audit often takes less than an hour, and it can stop another month of waste. You do not need a full redesign to find savings. You need a small set of numbers that shows what grew, what nobody uses, and what gets stored twice.

Start with the items that move cost the fastest:

  • Pull the 10 metrics with the highest series count. If one metric explodes because of labels like user ID, request ID, or full URL, fix that first.
  • Pull the 10 log streams by daily volume in GB. Big streams usually come from chatty services, debug logs left on, or repeated stack traces.
  • Mark any metric or log stream that feeds no dashboard, no alert, and no report. If nobody reads it, cut it or sample it.
  • Check each workload for two collectors doing the same job. A node agent plus a sidecar, or two log shippers on one host, can double the bill with no benefit.
  • Read your retention rules next to current product needs. Teams often keep 90 or 180 days of data because that was the old default, not because anyone still needs it.

The fastest win usually comes from the first two checks. A single high-cardinality metric can create millions of time series, and one noisy log stream can fill storage all day. Fixing either one often saves more than tuning ten smaller items.

Usage matters more than habit. If your team only uses logs to debug incidents from the last 14 days, keep short retention for hot storage and move older data out of the expensive path. The same logic applies to metrics. Keep long retention only for business or capacity trends you actually review.

Even mature stacks miss duplicate ingestion. One service sends logs to the platform, then the host agent sends the same file again. The invoice makes it obvious, but only after the money is gone.

If you repeat these checks before every invoice, cost spikes stop feeling mysterious. They turn into a short list of fixes with clear owners.

What to do next

Do not try to clean up everything at once. Pick one service this week, preferably the one that creates the biggest bill or the noisiest dashboards, and audit only that. One focused pass usually shows the pattern fast: too many labels, the same data sent twice, or retention that no longer matches how people work.

A small plan works better than a big cleanup project. Review one service's logs, metrics, and traces from the last 7 to 30 days. Set a monthly budget guardrail for each signal type so spikes show up early. Name one owner who approves new labels and retention changes. Write down what stays, what gets sampled, and what gets deleted sooner.

The owner part matters more than most teams expect. If nobody approves label changes, cardinality grows by accident. If nobody owns retention, old defaults stick around for months and the bill keeps climbing.

Set guardrails in numbers, not vague goals. Give logs a monthly cap, metrics a separate cap, and traces their own limit. If one area jumps 20% without a matching traffic jump, treat that as a bug and investigate it that week.

A short written rule helps. New labels need a reason. High-cardinality fields such as user IDs, request IDs, and raw URLs do not belong in metrics unless the team agrees they are worth the cost. Retention should match use case: maybe 30 days for most application logs, longer for security events, and much shorter for noisy debug data.

If you want a second opinion, Oleg Sotnikov at oleg.is advises startups and small teams on architecture, infrastructure, and Fractional CTO work. He has run production stacks with tools like Sentry, Grafana, Prometheus, and Loki, so an observability review can fit into a broader cost and systems audit.

That is enough to get moving. One service, one owner, one budget, and one written policy will do more than another month of hoping the next invoice looks better.

Frequently Asked Questions

Why is my observability bill going up even though traffic looks flat?

Traffic is only one input. Your bill can jump when one busy metric gains a high-cardinality label, when two pipelines send the same logs, or when noisy data sits in storage far longer than anyone uses it.

Which metric labels should I check first?

Start with the metrics that have the highest series count. Labels like user_id, request_id, raw URLs, session IDs, and any free-form text usually drive the fastest cost growth.

Should I keep user IDs and raw URLs in metric labels?

No. Put that detail in logs or traces instead. In metrics, use route templates like /orders/:id or small groups such as plan type so series count stays under control.

How do I find duplicate ingestion?

Trace one real event from the app to storage and see where it lands. If the same log line or error shows up twice, you likely run two collectors or left an old pipeline in place after a migration.

What should I put in a telemetry inventory?

Make a simple table with one row per stream, such as API logs, browser errors, traces, or database logs. For each row, note the source, destination, daily volume, monthly cost, retention, purpose, and owner.

How long should I keep debug logs?

Keep debug logs for the short window when people actually read them. For many teams, 3 to 7 days works fine, while normal app logs often need 14 to 30 days and audit or security data may need longer.

Should staging and production share the same telemetry pipeline?

Usually no. Staging and preview systems can flood storage with noisy test data and distort dashboards. Send them to a separate place or give them much shorter retention.

What is the fastest audit I can run before the next invoice?

Pull the 10 metrics with the highest series count and the 10 log streams with the most daily volume. One noisy metric or one chatty log stream often costs more than many small fixes.

Will shorter retention or label cleanup break my alerts?

They can if you change them without checking. Copy the dashboards and alerts that matter most, update them to the new labels or retention rules, and compare old and new views for a day or two before you remove the old path.

Who should own observability cost and telemetry rules?

Give one person clear ownership for each telemetry path and for label and retention changes. When nobody owns those rules, labels pile up, defaults stay forever, and costs drift month after month.