Loki vs ClickHouse for long log retention and query speed
Loki vs ClickHouse for long log retention: compare search speed, storage cost, team effort, and the trade-offs that shape observability spend.

Why long log retention becomes a budget problem
Teams rarely choose six months of logs on day one. They start with 7 or 30 days, then extend it bit by bit. Support needs older data. Security asks for a longer window. An audit arrives late. Daily ingest barely changes, but the bill keeps creeping up.
When people compare Loki and ClickHouse for long retention, they usually look at storage price first. That misses most of the cost. Long retention puts pressure on three things at once: where the data lives, how quickly people can search it, and how much work the system creates for the team.
Those three do not stay cheap together for long. Fast search usually needs more indexing, more compute, or both. Cheap storage often means slower reads. Lower maintenance usually means paying somewhere else for a simpler setup.
Another mistake is treating logs like cold backups. Logs are stored, but they are also queried under pressure. During an incident, a five minute search delay is not a small annoyance. It slows the investigation, burns engineer time, and can keep a customer issue open longer than it should be.
The budget usually spreads across a few places: raw storage, index or metadata storage, compute for ingestion and queries, and engineer time for tuning, retention rules, and cleanup. A poor fit wastes money even when daily ingest stays flat. One team might keep sending the same 200 GB a day, yet costs still rise because broad queries get heavier, old data gets slower to search, and more hardware gets added just to keep response times tolerable.
That is why retention decisions look fine at 30 days and painful at 180. The system seems affordable while the data set is young. The trouble starts later, when old logs stop feeling like cheap storage and start behaving like a second production system.
How Loki changes at 30, 90, and 180 days
Loki is not a general database. It stores compressed log chunks and relies on labels to narrow the search. That works well when engineers mostly inspect recent logs during incidents, deploys, and support work.
At 30 days, Loki usually feels good if label design is clean. A team filters by service, environment, region, or tenant and gets to recent logs quickly. Narrow questions are where Loki feels most comfortable.
At 90 days, small label mistakes start showing up in the bill and in search times. Broad labels force queries to scan too much data. Very detailed labels push cardinality up, increase memory use, and make the system harder to run. Teams often learn this the hard way after someone runs a wide search across many services and a long time range.
By 180 days, the tradeoff is clear. Storage can still look cheap because older chunks sit in object storage, and object storage costs much less than hot disks. The problem shows up when people search that older data. Broad queries often slow down because Loki still has to open and scan many chunks before it finds the lines you need.
The rough pattern is simple:
- 30 days: usually comfortable for daily debugging
- 90 days: label design and query habits start deciding cost and speed
- 180 days: retention can stay cheap, but wide searches on old data often get frustrating
That makes Loki a strong fit for teams that mostly care about fresh operational logs. If developers spend their time checking the last few hours or days, Loki is often a sensible choice. If support, security, or analytics teams ask open ended questions across months of history, Loki can feel slow even when storage still looks inexpensive.
That matches a practical lesson from Oleg Sotnikov's work with Grafana and Loki at scale: decide early what labels mean, what people really search for, and how often anyone needs old logs. Loki rewards discipline. It does not forgive messy labels for long.
How ClickHouse changes at 30, 90, and 180 days
ClickHouse treats logs more like analytics data. That changes the tradeoff quickly. With a good table design, it can scan large time ranges fast and compress repeated fields very well.
At 30 days, ClickHouse often looks impressive. Recent data stays easy to search, dashboards feel quick, and storage use may come in lower than expected because column compression works well on fields like service name, status code, region, and log level.
At 90 days, the early design choices start to matter. If logs land in sensible partitions, use a clear timestamp column, and expose the fields people actually query, ClickHouse still feels fast. If the schema is loose, or every search has to unpack raw text or JSON, speed drops and costs rise because the database reads more data than it should.
A few decisions shape the next several months: how data is partitioned, which fields get their own columns, how long hot data stays before moving to cheaper storage, and who owns retention and cleanup.
At 180 days, ClickHouse can still perform well, but only if someone designed and maintained it on purpose. Long retention means merges, TTL rules, partition cleanup, and storage tiering need attention. Compression helps, but good compression does not rescue a weak layout.
A simple example makes the difference obvious. Suppose a team asks, "What changed for this customer across the last 120 days?" ClickHouse handles that well when customer ID, service name, and timestamp live in clear columns. The same query feels much worse when those values are buried inside raw JSON that every search has to parse.
This is the part many teams miss. ClickHouse often buys speed and lower storage use with design work up front. If one engineer spends a few focused hours on schema, partitions, and retention jobs, the system can still feel cheap and fast six months later. If nobody owns those choices, the maintenance work and the bill usually rise together.
Query speed changes how teams investigate problems
Slow log searches do more damage than they seem to. They do not just waste a few minutes. They change how people think during an incident.
Most investigations are not one query followed by a clean answer. A developer starts with the last hour, spots a clue, widens the range to six hours, then a day, then maybe a week. If each step drags, the whole process gets blunt. People stop testing ideas and settle for the first explanation that seems good enough.
A common incident workflow looks like this:
- check errors from the last hour for one service
- widen to all instances of that service
- search the same pattern across a day
- compare it with a deploy window
- scan older logs to find when the pattern first appeared
This is where the difference between Loki and ClickHouse becomes practical. Loki usually feels fine when the time range is short and the labels are clean. If you know the service, cluster, and time window, you can move fast. But repeated ad hoc searches become painful when the query turns into a broad text scan across older data.
ClickHouse usually feels stronger as the search widens. Large scans, repeated comparisons, and aggregated views fit it better, especially when the table design matches the questions your team asks all the time. The price is more work up front around schema, ingestion, and tuning.
Dashboards and human searches also stress a system differently. Dashboards repeat the same patterns. Engineers do the opposite. Under pressure, they try messy searches, change filters, widen the range, and rerun variations until something clicks.
That difference matters because slow search has a payroll cost before it has a cloud cost. Oleg Sotnikov has made the same point in his observability and infrastructure work: if a team cannot move from a one hour view to a seven day view without friction, incident response slows down fast.
What actually drives storage cost
A log bill rarely matches the raw amount of data you ingest. A team may send 500 GB a day and then wonder why the monthly cost looks closer to 800 GB or more. The gap usually comes from indexes, replicas, fast disks for recent data, and the traffic needed to move data between tiers or systems.
Loki and ClickHouse charge you for different kinds of overhead, but the pattern is similar. You are not only storing the message itself. You are also storing metadata that makes search possible, copies for safety, and data placed on storage tiers with very different prices.
The biggest cost drivers are usually straightforward:
- indexes and metadata
- replicas for availability
- hot storage on faster disks
- read and transfer costs when people query or move old data
Hot retention is where many teams overspend. Most engineers read the last few hours or days. They almost never inspect month old logs unless there is an audit, a billing dispute, or a rare incident. Keeping 7 to 14 days on hot storage and pushing the rest into cheaper long term storage often cuts spend without losing history.
Noise matters even more than tool choice. Debug logs, duplicate events, health checks, and chatty background jobs can flood a system with data nobody uses. Cutting noisy logs by 30% often saves more money than switching backends. Teams that trim log noise and shorten hot retention usually see the bill drop before they change anything else.
That is also where architecture choices matter. A faster engine does not fix waste. If an app writes five lines where one line would do, both Loki and ClickHouse get more expensive. Start by storing less junk, keep less data on expensive disks, and treat long retention as a storage design problem, not only a database problem.
Operator time is part of the observability bill
The hidden cost in this decision is often not storage. It is the time someone spends keeping the system healthy.
Loki usually looks simpler at first. Many teams get it running and only run into trouble later, when labels have grown without a plan. A few bad label choices can drive cardinality up, slow queries down, and create regular cleanup work. Query habits matter too. If engineers keep running broad searches across huge time ranges, Loki can look cheap on paper and expensive in operator time.
ClickHouse has a different pattern. It can store logs efficiently and answer hard queries quickly, but it asks for more care early on. Table design, partitions, retention rules, merges, backups, and cluster health all need attention. If those choices are weak, the database keeps reminding you.
Before choosing, ask four plain questions:
- Who owns the system after launch?
- Who handles upgrades and broken queries?
- Who deletes old data and checks retention rules?
- Who gets paged when search performance drops?
If the honest answer is "the same busy engineer who already does five other jobs," operator time should matter more than raw storage price.
One extra hour a week sounds small. Over a year, that is about 50 hours. At $100 an hour, that is $5,000 spent to save a much smaller amount on storage.
For lean teams, Loki often wins when log usage is simple and label discipline is strict. ClickHouse often wins when the team can design it well and keep it tidy. The cheaper option is the one your team can run without constant babysitting.
How to choose without guessing
The right choice usually comes from team habits, not benchmark charts. If engineers mostly search the last few days during incidents, while finance or compliance wants six months of history, then "fast" retention and "cheap" retention are doing two different jobs.
A simple evaluation works better than a long debate.
Start by measuring normal daily ingest, peak ingest, and the real retention target. Peaks matter because a noisy deploy or broken service can double volume for hours.
Then write down the searches people actually run. During incidents, teams often search by service, time range, error text, request ID, or customer ID. Support work may need broader searches across several fields. If nobody can list real searches, the team is guessing.
Next, split retention into tiers. You might keep 14 to 30 days fast for daily work and move older logs into the cheapest setup you can tolerate. That one decision often has more impact than the choice of engine.
After that, assign ownership. Someone needs to own upgrades, broken ingestion, tuning, and cleanup. A faster system on paper can still cost more if it eats hours every month.
Finally, test with real logs. Not synthetic data. Load a week or two of production-like logs and run the same searches in both setups. Time a few common tasks: finding one error around a deploy, tracing a request across services, and checking a customer issue from last month. Then count the setup work too. If one option saves a few seconds per query but adds a day of tuning every quarter, the trade may not be worth it.
The practical rule is simple: choose the setup your team can operate calmly at 2 a.m. That is usually the cheaper choice over a full year.
A simple example: one product, six months of logs
Picture a small SaaS team with one customer-facing product. They keep application logs, nginx logs, and background worker logs for 180 days. Support checks fresh errors every day. Engineering looks at the last few hours during incidents. Older logs sit untouched most weeks, except when someone investigates a billing dispute, a rare bug, or a security issue.
That usage pattern matters more than any product page. If the team reads recent data all the time and barely touches month old logs, Loki usually makes more sense. They can keep recent logs easy to query, move older data to cheaper storage, and accept that old searches may take longer. That trade often lowers retention cost without hurting daily work.
Now change one habit. Say the team runs broad searches across three to six months several times a week. Product wants trend checks. Support compares error spikes across releases. Engineers search old worker failures by user ID, endpoint, or service name. In that case, ClickHouse starts to look better. It handles broad scans and repeated historical analysis more comfortably, even if the storage plan and day to day care take more effort.
A plain rule works well here:
- Loki fits a "hot recent data, cold archive later" workflow.
- ClickHouse fits a "search across months as normal work" workflow.
The wrong choice shows up quickly. If a team keeps six months in Loki but constantly asks broad questions over the full range, people wait on queries and start avoiding the system. If they put everything into ClickHouse but mostly inspect yesterday's errors, they may spend extra money and operator time for speed they rarely use.
Team habits beat tool slogans. Count how often people search old logs, how wide those searches are, and how much delay they will tolerate. That tells you more than any benchmark chart.
Mistakes that make the bill climb
The biggest cost jumps usually come from habits, not from the database itself.
One common mistake is keeping debug logs forever. Debug lines help during an incident, but their value drops quickly. If a team stores months of noisy retries, verbose app output, and low signal traces in the same retention tier as real errors, storage grows for little benefit.
Another mistake starts with good intent. Engineers add more fields so searches will be easier later, then include request IDs, user IDs, session tokens, container hashes, or other values that change almost every line. That makes indexes heavier, queries slower, and compression worse.
A few patterns appear again and again:
- treating all logs as hot data
- ignoring slow queries and buying more compute instead of fixing the data model
- copying a setup built for a company ten times larger
Hot storage should hold recent logs that people search every day. Older data often belongs in a cheaper tier with slower access that still works for audits, compliance, or a rare incident. Skip that split and the bill keeps rising.
Slow queries cause a second wave of waste. Once searches lag, teams add RAM, CPU, and extra nodes. That can help for a while, but the real problem often sits in labels, partitions, or query patterns.
The oversized stack mistake is especially common in startups and smaller SaaS teams. They borrow settings from a much larger company, then spend time patching and tuning a system that never matched their scale.
The boring rule works best: keep less, index less, cool older logs sooner, and fix query habits before buying hardware.
Next steps
Before you settle on a logging stack, ask one question: which searches does your team run every week? Real examples expose the answer quickly. You either need fast recent retention, cheap archive retention, or a mix of both.
Keep recent logs and older logs separate on purpose. Most teams need quick access to the last few days or weeks. Older data often matters for audits, incident follow-up, or the occasional trend check, but not at the same speed. Once you split hot retention from archive retention, the cost picture gets clearer.
Do not stop at storage pricing. Operator time can cost more than disks. If one option saves a little on raw storage but adds hours every month in tuning, cleanup, upgrades, and query troubleshooting, it is not the cheaper option.
A short review usually catches the biggest mistakes:
- list your five most common searches
- mark how far back each one usually goes
- separate fast retention from cheap archive retention
- estimate monthly operator hours
- test both options with a real sample workload
A trial with 30 to 60 days of real logs tells you more than a feature checklist. Measure storage growth, common query times, and how much babysitting the system needs.
If you want a second opinion before observability costs turn into a habit, Oleg Sotnikov at oleg.is helps startups and smaller teams review architecture, infrastructure, and retention plans. A short consultation can be cheaper than months of overspending and a painful migration later.
The best choice is usually the one your team can explain on one page, run without drama, and still afford when log volume doubles.
Frequently Asked Questions
Is Loki cheaper for six months of logs?
Usually yes, if your team mostly reads the last few hours or days. Loki stores old chunks in object storage well, but broad searches across months often get slow, and that extra wait turns into real labor cost.
When does Loki start to feel slow?
Trouble often starts around 90 days if labels drift or people run wide searches. By 180 days, old data may still sit in low-cost storage, but broad text scans can feel sluggish.
When should I choose ClickHouse instead?
Pick ClickHouse when people search across months as part of normal work. It handles large time ranges better when you store fields like customer ID, service name, and timestamp in clear columns.
Should I keep all logs on hot storage?
No. Keep only the recent window on fast storage and move older logs to a cheaper tier. Most teams read fresh logs every day and touch old logs only for audits, disputes, or rare incidents.
What makes log retention cost more than expected?
Raw storage is only part of it. Indexes, replicas, query compute, transfer costs, and engineer time often push the bill up more than people expect.
Does query speed really matter during incidents?
Yes, because slow searches change how people debug. When every query drags, engineers test fewer ideas, settle early, and spend longer in incidents.
What is the first thing to fix if my log bill keeps growing?
Start with noise. Cut debug spam, duplicate events, health checks, and other low-value logs before you switch tools. Many teams save more by sending less data than by replacing the backend.
How should I test Loki and ClickHouse fairly?
Load real production-like logs and run the searches your team uses every week. Time common tasks, check storage growth, and count the hours needed for setup, cleanup, and tuning.
Who should own the logging system after launch?
One person should own upgrades, retention rules, broken ingestion, and slow queries. If nobody owns those jobs, the system will drift and costs will rise no matter which tool you pick.
What usually works best for a small SaaS team?
A small SaaS team usually does well with Loki for recent operational logs and a cheaper archive for older data. If support, security, or product teams search months of history often, ClickHouse usually fits better.