May 31, 2025·8 min read

Database cleanup jobs that stay clear of live traffic

Learn how to run database cleanup jobs with small batches, short lock limits, and safe time windows so users keep working without slowdowns.

Database cleanup jobs that stay clear of live traffic

Why cleanup jobs slow down live traffic

A cleanup query can look harmless and still cause a mess in production. The problem is rarely the idea of deleting old rows. The problem is doing too much at once while real users are reading and writing the same database.

Large delete or update statements keep locks open longer than many teams expect. While those locks stay open, normal requests wait. A few waiting requests can turn into a queue fast, and latency climbs.

The query can also fight with live traffic before locking becomes obvious. If the database cannot find the target rows through a narrow index, it scans a big part of the table. That scan burns CPU, memory, disk, and cache that your app also needs for normal reads and writes.

Peak hours make everything worse. A job that feels fine at 2 a.m. can push response times up within seconds at 2 p.m. Busy systems have very little spare room, so even a moderate cleanup job can tip them into slowdowns.

One oversized batch is often enough to cause visible trouble. Deleting 50,000 or 100,000 rows at once can create long transactions, bigger write bursts, and more pressure on replication. Replicas fall behind, read traffic sees stale data, and some requests start timing out.

A simple example: an app deletes old sessions in one huge statement while users are logging in. The cleanup job touches the same table that active requests need. Users do not care that the job is "maintenance." They only see a slow login screen.

That is why cleanup jobs need hard limits. Without limits on lock time, scan size, and run windows, maintenance stops being background work and starts competing with customers.

What invisible maintenance looks like

The best cleanup jobs are the ones nobody notices. A customer opens a page, updates an order, or saves a form, and the app feels normal. Response times stay steady. Writes still go through. Support does not get odd reports about frozen pages or random timeouts.

That usually happens when the job works in small, quick bursts. It selects a limited set of rows, deletes or updates them, commits, and gets out of the way. Because each batch ends fast, locks do not sit on hot tables for long. Other sessions can keep reading and writing without lining up behind maintenance work.

Good cleanup also knows when to back off. If request latency starts creeping up, lock waits grow, or the database gets busier than usual, the job slows down or pauses. That matters more than raw speed. Finishing an hour later is usually fine. Making the product feel slow for ten minutes is not.

A clean stop matters too. You should be able to pause the job at any point and leave the database in a tidy state. That means no giant transaction holding work open forever and no half-done pass that needs manual repair. The next run should start from a clear checkpoint and continue without guesswork.

Think about a busy app during the workday. People add records every few seconds while a cleanup task removes old logs in the background. If users cannot tell the difference, the job is doing its job. If page saves start to drag, the maintenance is too loud.

Invisible maintenance is not magic. It is restraint: short batches, short locks, quick commits, and a job that knows live traffic comes first.

Choose the rows before you touch them

A cleanup job should know its target before it deletes a single row. "Old data" is too vague. Use a rule you can test, such as "sessions expired more than 30 days ago" or "email logs already archived and older than 90 days." If you archive before deletion, use the same rule in both steps so you do not move one set of rows and remove another.

Use filters that match indexed columns whenever you can. Dates, status fields, tenant IDs, and primary keys are common choices. If the database has to scan the whole table just to find candidates, the cleanup already puts pressure on live traffic. Check the index first, then run a count query with the same filter.

Process rows in a stable order. ID and created_at are usually the safest options because they stay predictable between batches. That makes the job easier to pause and resume, and it keeps each batch close to the last one. Random order is a bad habit. It can drag active rows into the cleanup and make user requests wait on the same pages or locks.

Before the first real run, count how many rows match the rule. That number changes the plan. Ten thousand rows may finish quietly in one evening. Twelve million rows need a slower schedule and tighter control. The first count also tells you if your filter is too broad.

One more check helps a lot: run a dry pass that selects, but does not delete, the first small batch. Look at the first and last IDs, the execution time, and whether a second run returns the next rows you expect. If that preview looks messy, fix the selection rule before you touch production data.

Set a batch size that stays boring

A good batch size is one that nobody notices. If users feel a pause, see slower pages, or wait longer on writes, the batch is too big.

Start smaller than you think. In many systems, that means testing 25, 50, or 100 rows first, not 5,000. Run the job during normal product traffic and time every batch. One fast test on a quiet database tells you very little.

Watch the batch itself, but also watch the app. Check request latency, lock waits, CPU, disk I/O, and query time while the job runs. A batch can finish in 40 ms and still push page loads up. If users feel the change, the limit is already too high.

Increase the size in small steps. Doubling works at first if the numbers stay flat, then smaller jumps make more sense. A path like 50, 100, 200, 300 is usually better than jumping straight to 1,000. The moment latency moves in a way users might feel, step back to the last quiet number and leave it there.

Different jobs need different limits. Deletes usually need the smallest batches because they create more churn and can hold locks longer. Updates often need careful limits, especially if they touch indexed columns. Backfills can sometimes run in larger batches if they write new data without fighting hot rows.

A busy app might handle 200-row backfills all day and still struggle with 50-row deletes on the same table. That is normal. Treat each job as its own test, not as a rule you can copy everywhere.

A useful sanity check is simple: after each change, ask whether anything moved that a user might notice. If the answer is yes, the batch is too ambitious.

Stop when lock time grows

Untangle Hot Tables
Find out why even tiny batches still collide with active rows.

A cleanup job should give up fast when it meets live work. If one batch waits on a row that a customer action already uses, the safe move is to stop that batch. Waiting sounds harmless, but a few stuck queries can pile up and turn a quiet job into a user problem.

Set a short lock wait limit for every batch. Keep it low enough that the job disappears into normal traffic, often a fraction of a second or a couple of seconds at most. The exact number depends on your app, but the rule is simple: if the batch cannot get the lock quickly, it should move out of the way.

That changes the job's behavior in a good way. It aborts one small batch instead of blocking active requests. It leaves busy rows alone and returns later. It also gives you useful logs when the system is under real load.

Record every lock timeout or early stop. Count how many batches ended early, how long they ran, and which table or query caused trouble. After a day or two, a pattern usually appears. Maybe the job is quiet at night and noisy at 9:00 a.m. Maybe one table stays hot all day.

Do not retry the same failed batch right away. That often creates a loop where the job keeps colliding with the same user traffic. Pause for a minute, or wait for the next scheduled run, then try again with the same small batch size.

Cleanup stays out of the way when it treats lock pressure as a stop sign. If lock time grows, back off and try later.

Use a time window that fits real traffic

Traffic is rarely quiet just because the clock says midnight. Some apps get a second wave from another time zone. Others stay calm most nights, then spike at month end or after a marketing email. Pick the quietest hours from real usage data, not habit.

Give the job a fixed start time and a fixed stop time. That matters more than people expect. A cleanup that runs from 1:00 to 3:30 every day is predictable, easy to watch, and easy to stop. If it does not finish, let it continue on the next run. Cleanup works best when it acts like background noise.

Set a simple rule for pausing early. If request latency rises, active connections climb, or the queue starts to build, stop before users notice. Waiting 15 or 20 minutes is usually cheap. Letting the job push into a traffic spike is not.

Teams also get into trouble by scheduling cleanup during the same window as other heavy work. The database does not care that each task looked safe on its own. Backups can flood storage and I/O. Reports can scan huge tables. Imports add write pressure. Sync jobs hold locks longer than expected. Index work and vacuum can compete for the same resources.

A good window leaves space around those jobs instead of overlapping with them. Even a 30-minute gap helps.

If you run a global product, use more than one window when needed. A short run during the quietest US hour and another during the quietest European hour can be gentler than one long session. Short windows force discipline. They keep the job boring, and boring maintenance is usually the safest kind.

Build the job step by step

Start with the safest part: the selector. Write the query that finds the rows you want to clean up, then run it as a plain read. Check the count, inspect a few sample rows, and make sure it only touches old data you really want gone.

After that, run one very small delete or archive batch. Think 50 or 100 rows, not 50,000. While it runs, watch app latency, error rate, and query time. If users feel nothing, you can move forward.

A simple build order works well. Test the selector with read-only queries first. Run one tiny batch in production and watch the app. Add a short sleep between batches, even a second or two. Save a checkpoint after each successful batch. Give the team an easy pause and resume switch.

That short sleep matters more than people expect. Cleanup usually becomes visible when it runs batch after batch with no break. A brief pause lets normal traffic get in, finish its work, and keep the system calm.

Always save progress after each batch. The simplest checkpoint is often the last processed ID or timestamp. If the job stops halfway through, you do not want to scan the same rows again or guess where it left off.

Make pause and resume boring. Store a stop flag in a table, config value, or job control panel that the team can change without a code deploy. If support sees latency rise, they should be able to pause the job in seconds.

Once this setup is in place, you can tune batch size, sleep time, and lock limits with much less risk. The job becomes predictable, and predictability is what keeps maintenance invisible.

A simple example from a busy app

Reduce Replica Lag Risk
Tighten batch pacing and commit patterns before lag hits users.

A SaaS product keeps old session rows for far too long. After a few months, the session table gets heavy, but the real problem is traffic. Users log in, refresh sessions, and sign out all day, so the table never really goes quiet.

The team does not try one giant delete. That would block normal work and make the app feel slow. Instead, they remove the oldest expired sessions first, in small batches, ordered by time.

A typical run looks like this:

  • Select the oldest 500 expired session rows.
  • Delete only those rows.
  • Commit right away.
  • Sleep for a few seconds.
  • Start the next batch only if the database still looks calm.

This is what good cleanup looks like in practice. It stays boring. If one batch takes 60 to 100 ms, users usually never notice. If a bigger batch pushes lock waits up to 800 ms or more, people start to feel it during login.

The team also watches the traffic curve, not just the database. Morning logins, lunch breaks, and end-of-day spikes all change the safe batch size. A batch that works fine at 2:00 a.m. may be too big at 9:05 a.m.

So the job has two stop rules. It pauses when lock wait time rises above a small limit, and it pauses when login traffic climbs past the normal baseline. Then it tries again later inside its maintenance window.

That sounds slower, and it is. But slow is often the right choice. Deleting 2 million old rows over several nights is better than freezing a busy table for 20 seconds once. In a live app, invisible work wins.

Mistakes that make cleanup visible

Most cleanup pain starts with one bad assumption: the app can tolerate a big maintenance query while users are active. It usually cannot.

The loudest mistake is deleting too much in one transaction. A single delete of millions of rows can hold locks for too long, grow the transaction log fast, and leave a huge pile of dead rows behind until the commit finishes. Users may not see the delete itself, but they will feel slower writes, stuck queries, or random timeouts.

Another common mistake happens before any row gets deleted. If the job finds rows by sorting or filtering on columns without a useful index, the database has to scan far more data than expected. That extra work burns CPU and I/O, and live traffic pays the price.

Teams also make cleanup visible by running several workers at once. It sounds faster, but parallel cleanup often turns one quiet job into table and index contention. Workers compete for the same pages, trigger more lock waits, and sometimes force retries. One steady worker is often less disruptive than four busy ones.

The last trap is watching only the primary database and ignoring side effects. Cleanup can look fine on the main node while replicas fall behind, logs grow fast, or vacuum cannot keep up. That is still user impact, just delayed.

A few warning signs show up early:

  • Replica lag climbs and does not recover quickly.
  • Transaction or binlog volume jumps far above normal.
  • Vacuum or autovacuum starts trailing behind.
  • Lock waits rise during the cleanup window.

A busy app that deletes old sessions is a simple example. If the team starts three workers, each deleting 100,000 rows per pass, the app may survive for a few minutes and then get noisy. Smaller batches, fewer workers, and a query that uses the right index keep cleanup quiet, which is the whole goal.

Quick checks before you press start

Review Your Cleanup Plan
Get a second opinion on batch sizes, lock limits, and run windows.

A short test run catches most bad surprises. The goal is simple: the app keeps serving users, and almost nobody notices the work.

Run the selector by itself and inspect a small sample of rows. Check the dates, status flags, tenant filters, and sort order. If a few rows look wrong, stop there.

Time one real batch under conditions that match live traffic. It should finish well below your lock time limit, not barely under it. If the limit is 200 ms, a batch that takes 180 ms is already too close.

Watch your dashboards during the test. API latency, slow queries, lock waits, and error rates should stay flat. A tiny spike in one batch can turn into visible pain during a long run.

Decide who can pause the job right away. Name one owner and one backup. A shared chat with no clear owner wastes time when traffic jumps.

Write down the stop plan in plain language. Note how to disable the scheduler, how to find the last processed ID or timestamp, and what to check after you stop.

The written plan matters more than most teams expect. Once a job starts deleting in small batches, people relax and assume it is safe. That is often when a bad filter, a missing index, or a lunchtime traffic spike slips through.

One practical rule helps: do not start until you have checked row selection, batch timing, live latency, ownership, and stop steps. Five careful minutes now can save hours of lock chasing later.

What to do next if the job still fights traffic

If the same table keeps blocking user writes, the job is touching data that is still too hot. Smaller batches help for a while, but they do not fix a table where old and active rows live side by side and share the same busy indexes.

A cleaner fix is to separate cold data from live data. Move older rows into an archive table or older partitions first, then delete from the archive on its own schedule. That gives your app fewer collisions because the busiest writes stay on the smaller, newer slice of data.

Direct deletes are often the wrong move when traffic stays high all day. In that case, mark rows for archive, copy them out, verify counts, and delete them later during a quieter period. It takes an extra step, but it is usually less painful than repeated lock fights in production.

If cleanup still causes noise, review the basics again. Check retention rules. Teams often delete data earlier than they need to, or keep too much because nobody updated the policy. Look at indexes on the delete filter and sort order. A missing or bloated index can turn a small delete into a long scan. Compare job timing with real traffic, not assumed traffic, because peak periods drift over time. If needed, split one large job into separate jobs by tenant, date range, or status so one bad slice does not slow everything.

At that point, the problem often stops being a script problem and becomes an architecture problem. A fresh review can save a lot of trial and error.

If your team keeps hitting that wall, Oleg Sotnikov at oleg.is can help review the retention plan, batch and lock limits, rollout order, and infrastructure tradeoffs. An outside review is often faster than spending another week tuning the same failing job.

Frequently Asked Questions

What batch size should I start with?

Start tiny, usually 25 to 100 rows. Run the job during normal traffic, not on a quiet database, and watch request latency, lock waits, and query time. If anything moves, go smaller, then raise the size in small steps until the job stays boring.

How can I tell if users will notice the cleanup job?

Watch the app, not just the database. Slow logins, slower saves, higher lock waits, rising error rates, and replica lag all mean the job is too loud. When you see that, pause the job, lower the batch size, or add more sleep between batches.

Is one big nightly delete ever a good idea?

Usually no. One large delete can keep locks open too long, spike write load, and push replicas behind. Small batches with quick commits take longer overall, but they keep the app usable while the cleanup runs.

What lock wait timeout should I use?

Use a short limit so the job steps aside fast, often well under a second or up to a couple of seconds. If a batch cannot get its locks quickly, abort that batch, record it, and try again later. Waiting longer usually turns maintenance into a user problem.

How do I choose rows without scanning the whole table?

Pick a rule you can test, like expired sessions older than 30 days. Filter on indexed columns and process rows in a stable order such as id or created_at. Before you delete anything, run a read-only preview and confirm the sample rows look right.

Should I run more than one cleanup worker?

Start with one worker. Parallel cleanup often makes workers fight over the same table and indexes, which raises lock waits and retries. Add more workers only after one worker stays quiet under real traffic.

What checkpoint should the job save?

Save a checkpoint after every successful batch, usually the last processed id or timestamp. Keep a simple pause flag that your team can change without a deploy. That lets you stop fast and resume without guessing where the job left off.

Why does cleanup work at 2 a.m. but fail at 2 p.m.?

Because safe limits change with real traffic. A batch that feels fine overnight can compete with logins, writes, and cache churn during busy hours. Use short maintenance windows based on actual usage, and let the job pause when traffic rises.

What if tiny batches still hit hot rows?

The table may mix old rows and hot rows too closely. At that point, move cold data into an archive table or older partitions, then clean that data on its own schedule. That usually reduces collisions more than endless tuning on the live table.

What should I test before the first real run?

Check the selector, inspect a few sample rows, time one real batch, and confirm latency stays flat while it runs. Also make sure one owner and one backup know how to pause the scheduler and find the last checkpoint. A short test run catches most bad surprises.

Database cleanup jobs that stay clear of live traffic | Oleg Sotnikov