Rollback plan: what to restore beyond code changes
A rollback plan should restore more than code. Learn how to handle data, feature flags, jobs, cache, and customer messages when a release fails.

Why code rollback alone does not fix the problem
A code revert only changes what runs now. It does not undo what the bad release already changed.
That gap explains why rollback plans often fail in real incidents. The app version goes back, but users still hit broken pages, missing records, duplicate emails, stuck orders, or old error messages. The code changed back. The service did not.
A rollback plan has to cover every part of the release: data written under the new logic, flags and config that still point to the bad behavior, background jobs that are still running, cache and other derived state, and the messages your team sends to customers and coworkers.
Take a simple case. A release adds a new checkout rule, writes new values into the database, and starts a job that retries failed payments. You revert the code five minutes later. The new values stay in the database, the retry job keeps firing, and cached checkout pages still show the broken flow. From the team side, the deploy is gone. From the customer side, the problem is still live.
"Service restored" should mean something simple: users can finish the task they came to do, their data is correct, and the system has stopped creating new damage. If customers can log in but cannot place an order, service is not restored. If the page works but yesterday's bad job is still sending wrong invoices, service is not restored either.
This is where a rollback plan matters. It should define what to stop, what to reverse, what to recompute, and who needs an update. Lean teams, especially small AI-augmented teams, need this even more. When you move fast, a code-only rollback feels convenient. It is also how a small incident turns into a long cleanup job.
What to map before you ship
Before you deploy, write down every moving part the release touches. People remember the code diff. They forget the flag someone flipped, the job schedule that changed, or the support reply that now needs an update.
A good rollback plan starts with a small release map. It should cover the parts that can break service or keep the bug alive after you revert the code: application code, deployed services, database changes, feature flags, config, background jobs, queues, schedules, and customer-facing updates.
Keep code changes and state changes in separate buckets. Code is usually easy to roll back. State is harder. If the new version wrote data in a new format, queued thousands of jobs, or changed a production setting, the old code may still fail after the deploy is undone.
Ownership matters as much as the map itself. During an incident, each part needs one named owner and one backup. If "the team" owns the database rollback, nobody owns it once the pressure starts.
A simple release note can do the job. For each changed part, write four things: what changed, how to undo it, who does it, and what has to happen first. Order matters. If a release adds a new billing job and a new flag, someone may need to pause the job before another person reverts the app.
Keep the document short enough to read in under a minute. One page is usually enough. If the list feels too long, cut anything that does not affect recovery. Under stress, shorter wins.
This is the kind of operating habit Oleg Sotnikov often pushes with startup teams on oleg.is: make the recovery path visible before launch. If a tired engineer can scan the map at 2 a.m. and know the next step, the document is doing its job.
Data needs its own rollback path
A deploy can roll back in minutes. Data usually cannot. If the release added bad rows, changed values, or wrote records in the wrong format, putting old code back may still leave the app broken.
A solid rollback plan names every data change the release can make. That includes new records, updates to existing fields, deleted data, and anything copied or transformed by a migration. Teams often remember schema files and forget the rows that changed under real traffic.
Some changes are easy to undo. You can remove duplicate rows, flip a status back, or delete test data that slipped into production. Other changes need more care. If a migration dropped a column, rewrote IDs, split one field into two, or started saving data in a new shape, the old app may not know how to read what is already there.
Mixed states cause the most pain. One customer may have old data, another may have half-migrated data, and a third may have new data plus a failed retry. That is why a database rollback needs rules for partial writes. Decide how you will find affected records, how you will mark them, and who will run the repair.
A short plan helps:
- list the tables or collections the release can change
- mark each change as reversible, repairable, or backup-only
- prepare a query that finds bad or half-written records
- decide who approves a restore and who runs the fix
Restoring backups should be the last move, not the first. Use it when the release corrupted a wide set of records, when you cannot separate good data from bad data, or when legal and financial records must match an exact point in time. Patch forward when the damage is narrow and you can repair records safely without wiping fresh customer activity.
A simple example: a release changes invoice status rules and writes "paid" too early for 300 orders. Old code will not correct those rows. The team needs a repair script, a way to stop more writes, and a clear count of which records changed before service is truly back.
Feature flags and config can keep the bug alive
A code rollback can still leave the problem in place if live behavior comes from switches outside the release. A flag may still send users into the broken path. An env var may still point to the wrong model, timeout, or API endpoint. A hidden admin switch can do the same, and teams often forget those exist until something breaks.
Treat config changes as part of the release, not as side notes. Track every flag that changed, what the old value was, who changed it, and where that switch lives. Do the same for remote config, env vars, and any manual toggles in an internal admin page.
Keep the recovery notes practical. List each flag or setting the release touched, record the safe value to restore, name the person who can change it during an incident, and note whether the change needs a restart, cache clear, or worker restart.
Ownership matters more than many teams expect. During an outage, one person may control feature flags, another may control secrets, and someone else may have access to the admin switch. If nobody knows who can flip what, the team wastes time asking for access while users keep hitting the bug.
Stale values also cause trouble. A flag can be off in one region and still on in another because sync lagged or a local cache did not refresh. Background workers often keep old env vars until someone restarts them. That is why a rollback can look successful in one place and broken in another.
A common example looks like this: the team rolls back a release, but users still see the new checkout flow. The code is old again, but the remote flag still routes 20 percent of traffic to the bad path on two servers. Until someone checks flag state across regions, the team keeps blaming the code and loses another 15 minutes.
If the bug survives the revert, compare flags and config first. The fix may be one switch, not another deploy.
Jobs and queues need a stop and restart plan
A release can be gone while its jobs keep running. Workers may still process old messages, retry failed tasks, or fire cron tasks that write the same bad data again. If your rollback plan ignores that layer, service may stay broken after the deploy is reversed.
Start with an inventory. Write down every worker, queue, scheduled task, webhook consumer, and retry rule touched by the release. Include the quiet jobs too: nightly imports, emails, billing syncs, search indexing, and cleanup tasks. Those are often where the mess spreads because nobody remembers to stop them.
Before rollback, decide which jobs must pause first. Any task that can keep changing records, charging users twice, or sending the wrong notice should stop before you restore the app. Some teams pause the whole queue. Others pause only the risky job types. Either choice works if it is written down and one person owns it.
You also need a rule for jobs already in flight. A worker may start with the new code and finish after you switch back. That can leave half-done updates or duplicate side effects. For each important job, define whether you let it finish, cancel it, or send it to manual review.
Keep the runbook simple:
- list queues, workers, cron tasks, and retry settings
- pause jobs that can keep writing bad data
- check in-flight jobs before workers restart
- decide which queued jobs you will drain, replay, or discard
- record who approves each step during recovery
Queued work deserves extra care. Draining works when jobs are safe but need old code to process them. Replay works when you can rebuild the queue from clean events. Discarding is sometimes the least bad option, but only if you know what users will miss and how you will fix that later.
A simple example: a bad release changes invoice logic, then background jobs keep generating incorrect invoices every minute. Rolling back the web app alone does nothing if the workers keep running. Pause the invoice jobs, inspect the queue, remove or reprocess the bad items, and restart workers in stages. That is when the rollback actually restores service.
Cache and derived state can outlast the release
A bad release can stay visible after you roll the code back. Caches, sessions, and indexes often keep serving old values, so users still see the bug and your team starts doubting the rollback plan.
That happens because code is only one part of system state. The app cache may still hold responses built by the broken version. The CDN may keep old pages at edge locations. User sessions may carry bad permissions or invalid cart data. Search indexes may still point to records that no longer match the live code.
Write down every store of derived data before you ship. For most teams, that means app cache, CDN cache, sessions, search indexes, generated feeds, and any prebuilt pages or reports. Skip one of them and the service can stay half-broken after the revert.
What to clear and what to rebuild
Do not clear everything by default. Some resets fix the issue fast. Others create a new problem.
Clear app cache when object shape, field names, or response rules changed. Purge CDN cache when users may still get broken pages, stale API responses, or old assets. Reset sessions only when session data itself changed or stored a bad state. Rebuild search indexes when filters, ranking, document fields, or visibility rules changed. Recompute reports or materialized views if they were generated from wrong data.
A full session wipe can log everyone out. A full search reindex can take hours. A CDN purge can hit traffic and cost. Decide in advance what is safe to clear fast, what should rebuild in the background, and what can wait for normal expiry.
Stale data can survive much longer than people expect. If your app cache lives for 30 minutes and the CDN keeps content for an hour, wrong results can linger well after the code rollback. Search indexes can last even longer if they only update on a schedule. Sessions may stay alive until users sign out or their token expires.
Use a small check during recovery. Open the product as a new visitor, as a signed-in user, and through search. If one path still shows the bug, stale state is still in play. That is often why a rollback seems incomplete even when the deploy itself succeeded.
How to run a rollback step by step
When production breaks, speed helps, but order matters more. A rollback plan should name one person to lead the call, approve each move, and stop the team from trying five different fixes at once.
Before anyone reverts code, stop the damage. If the release writes bad records, pause those writes first. Put risky actions into read-only mode, stop imports, and pause workers that keep feeding the problem. If you changed a schema or data format, treat database rollback as a separate task, not as an automatic result of a deploy revert.
A practical sequence looks like this:
- Confirm the issue and assign one rollback lead. That person decides whether to roll back, who pauses traffic or jobs, and who posts updates.
- Turn off fast-moving parts first. Reverse feature flags and config that can stop the bad behavior in seconds.
- Revert the release. Deploy the last known good version, then clear or rebuild cache only where stale data can keep the bug alive.
- Restart background jobs in stages. Check queue depth, failed jobs, and retries before you let every worker run again.
- Test the paths users actually take. Log in, create or edit one record, complete one payment or checkout flow if relevant, and verify one job finishes end to end.
Do not call the incident closed just because the app loads again. Check error rate, write volume, and support inbox patterns for at least a short window. Teams often miss delayed jobs, stale sessions, or bad data that keeps spreading after the code is back.
Once service holds steady, send a short customer update. Keep incident communication plain: what users may have seen, whether data looks safe, and what happens next if you still need cleanup. That message matters because silence makes even a small outage feel worse.
A simple example from a failed release
A team ships a new checkout flow at 2:00 PM. The code looks fine in staging, so they turn it on for 25 percent of customers with feature flags. Within minutes, support sees complaints: some carts show the wrong total, a few customers get charged twice, and order emails arrive for purchases that never finished.
The bug is not only in the web app. The release also changed how the database stores discount reservations, and a background worker now retries failed payment events more aggressively. A code rollback on its own will not clean up the broken rows or stop the worker from replaying bad events.
First 30 minutes
The first move is to stop the spread. At 2:07 PM, the team turns off the new checkout flag, pauses the payment worker, and blocks new deploys. Support gets a short note: checkout has issues, the team is working on it, and affected customers will get an update.
At 2:12 PM, one engineer checks logs and finds that the new flow writes duplicate discount holds when a payment call times out. Another engineer queries the queue and sees hundreds of retry jobs waiting. A third person compares recent orders against payment records to find who might have been charged twice.
At 2:18 PM, they run the old code again. That helps new sessions, but old cart totals still look wrong because the cache holds bad values. They clear the checkout cache and rebuild totals for active carts.
At 2:24 PM, they run a database repair script that removes duplicate discount holds and marks uncertain orders for manual review instead of automatic capture. Only after that do they restart background jobs, first with payment retries still disabled, then with safe jobs like confirmation emails.
By 2:30 PM, checkout works for new customers, the queue is under control, and support has a clean list of affected orders.
What changed after
The team updates its rollback plan the same day. Every checkout release now needs four things before launch: a reversible data change, a worker pause command, cache reset steps, and a customer message template.
They also add one rule that saves a lot of pain: nobody runs a database rollback or restarts workers from memory. Someone must follow a written checklist. That sounds boring. It also prevents the second mistake, which usually hurts more than the first.
Mistakes teams make during recovery
Teams under pressure often treat recovery like one action: roll back the deploy, restart a few services, and move on. That is where rollback plans break down. Service can still fail even after the old code is back.
One common mistake is assuming the database moved back in time with the app. It did not. If the new release changed rows, added bad records, or ran a one-way migration, the old code may come back to a database it no longer understands. Teams then waste 20 or 30 minutes chasing "new" errors that actually came from old data left behind.
Another mistake comes from background work that keeps running after the revert. Queues, retries, scheduled jobs, and webhooks can keep sending the same bad payloads or recreating broken records. A team might roll back at 10:05 and still see damage at 10:20 because workers never stopped.
Cache creates a different trap. People clear it fast, hoping for a clean reset. That can help, but it can also make things worse if the same bad job, flag, or query fills the cache again. Before you flush anything, check what rebuilds that state and whether it is safe to let it run.
Customer updates often go wrong too. Teams post "fixed" when the error rate drops, but they have only tested the first screen or one happy path. Customers then hit the same bug in checkout, account settings, or a delayed email flow. That second failure hurts trust more than the first.
A calmer recovery usually follows the same order every time: stop jobs and retries that can keep causing harm, verify database state instead of checking app version alone, confirm that flags and config no longer point to the bad behavior, and test the full customer path end to end before you announce anything.
A small example makes the point. A release adds a broken billing rule, the team rolls back the code, and the site still charges the wrong amount. Why? A retry worker keeps replaying the bad billing events, and cached totals still show old numbers. The deploy was reversed. The service was not restored.
Quick checks and next steps
A rollback is done only when users can complete the action that failed. Getting the old code back is only part of the fix. You also need to confirm that jobs stopped, data looks normal again, and support knows what to say.
Use a short post-rollback check before you close the incident:
- ask one real person on the team to complete the broken action from start to finish
- confirm that background jobs, queue workers, or scheduled tasks stopped if they were writing bad data or retrying failed work
- check that data returned to a safe state, with correct totals, readable records, and no half-finished updates
- clear or rebuild cache only where needed, then test again
- give support a short message they can send customers about what broke, what works now, and whether users need to retry anything
That last step matters more than many teams admit. If a customer still sees a failed payment, a missing record, or a stuck order, they do not care that your rollback plan succeeded on paper. They care whether they can finish the task now.
After service is stable, write down what slowed recovery. Maybe the team lacked a database rollback path. Maybe feature flags stayed on. Maybe nobody owned customer updates. Small notes made right after the event often prevent the same mess next month.
If this keeps happening, outside help can be worth it. Oleg Sotnikov works with startups and small teams on release discipline, recovery planning, and AI-first engineering operations, and that kind of review is often cheaper than repeating the same outage twice.
Frequently Asked Questions
Why doesn't a code rollback fix everything?
A code revert only changes what runs now. It does not remove bad rows, stop active jobs, clear stale cache, or undo a flag that still sends users into the broken flow.
Treat rollback as service recovery, not just a deploy action. Check data, config, workers, and customer impact before you call it fixed.
What should I map before a release?
Write a short release map before you deploy. Include code, database changes, feature flags, config, workers, queues, schedules, cache, and any customer message the release changes.
For each part, note what changed, how to undo it, who owns it, and what order the team should follow under pressure.
How do I plan for database rollback?
Give every data change its own rollback path. Note which tables or collections the release can touch, how you will find bad records, and who approves a repair or restore.
Use backups only when you cannot separate good data from bad data or when records must match an exact point in time. If the damage is small, repair the affected rows instead.
Can feature flags and config break rollback?
Check every flag, env var, remote config value, and admin toggle the release touched. One old setting can keep the bug alive even after you restore the old code.
Record the safe value, who can change it, and whether the team needs a restart or cache clear after the change.
What should I do with jobs and queues during rollback?
Yes, and they often do. A worker can keep replaying bad events, sending wrong emails, or writing the same broken data long after the deploy is gone.
Pause risky jobs first, inspect what is already in flight, and decide whether to drain, replay, or discard queued work before you restart everything.
Do I need to clear cache after a revert?
Clear cache only where stale state can keep the problem visible. If you wipe everything without a plan, you may log users out, spike traffic, or refill the cache with the same bad data.
Think about app cache, CDN cache, sessions, search indexes, and generated reports. Reset only the parts that still show the broken state.
Who should run the rollback?
Pick one rollback lead. That person decides the order of actions, approves changes, and stops the team from trying five fixes at once.
Give each recovery area a named owner and a backup. When everyone owns the database or the flags, nobody owns them during the outage.
When can I say service is restored?
Use a simple rule: users can finish the task they came to do, their data looks correct, and the system has stopped creating new damage.
If login works but checkout fails, service is not back. If the page loads but a worker still sends wrong invoices, service is not back either.
What should I check right after rollback?
After the revert, test the real user path from start to finish. Log in, create or edit one record, complete the broken flow, and confirm one background job finishes cleanly if that path matters.
Then watch error rate, write volume, queue depth, and support messages for a short window. Delayed failures often show up after the app looks normal again.
What mistakes slow recovery the most?
Teams often make three mistakes. They assume the database moved back with the app, they forget workers and retries, and they announce a fix after testing only one screen.
A calmer recovery stops harmful jobs first, verifies data, checks flags and config, and tests the full customer path before support tells users the issue is over.