Nov 04, 2025·7 min read

Safe schema migration pattern for calmer Friday deploys

Learn a safe schema migration pattern that uses expand, backfill, switch, and cleanup so your team can ship database changes with less risk.

Safe schema migration pattern for calmer Friday deploys

Why schema changes break normal releases

A normal code deploy assumes the app and the database can change together. In production, they usually can't. One server may run the new code while another still runs the old version, and both hit the same tables at the same time.

That overlap is where schema changes break. If the new release expects a column that isn't there yet, requests fail. If the database starts requiring a new field before old code knows how to fill it, writes fail.

Background jobs make this worse. Workers often keep running through a deploy for minutes or hours. An old worker can still read and write data after the web app has moved on, so a change that looked safe in staging starts throwing errors in production.

Rollbacks don't fully save you either. You can roll back code in a few minutes, but the data may already be in a new shape. If the old version can't read it, the rollback brings the old app back and leaves the outage in place.

Small mismatches create very visible problems. A missing default can block checkout. A renamed column can break account updates. A stricter constraint can turn a harmless edge case into a 500 error that customers hit right away.

Safe migrations work because they accept this messy overlap instead of pretending it doesn't exist. The database, app servers, workers, and queued jobs all move at different speeds. Treat deployment as a period of mixed versions and you avoid the release that looked fine in staging, then failed under real traffic.

How the four-step pattern works

Most failed releases happen when the app and the database change at the same time. A safer approach splits one risky change into four smaller steps: expand, backfill, switch, and cleanup.

First, expand. Add the new column, table, index, or nullable field. Old code must keep working after this change. If the current app can't run against the new schema, you're moving too fast.

Next, backfill. Move existing data into the new shape in small batches. Keep each batch short, watch database load, and make the job safe to rerun if it stops halfway.

Then switch. Point reads and writes to the new path only after you confirm the data is there. Many teams switch in stages or use a feature flag so they can turn the new path off quickly.

Last comes cleanup. Drop old columns, old writes, and compatibility code only after the app stays stable for a while. This is the only step that removes your easy rollback path.

The order matters more than the exact tools. During expand and backfill, the old app still runs. During switch, you still have a fallback because the old shape hasn't gone away yet. That makes rollback boring, which is exactly what you want on a Friday.

A simple rule helps: never introduce the new thing and remove the old thing in the same deploy. Keep both paths alive until production checks pass. Once the new path has handled real traffic without surprises, cleanup is safe.

What to check before you start

Most bad migrations fail before the first SQL statement runs. Teams miss one reader, one background job, or one report, and the deploy looks fine until something quiet breaks 20 minutes later.

Start with the table map. Write down every place that reads or writes it, not just the main app code. Check API endpoints, web pages, workers, scheduled tasks, admin tools, internal scripts, exports, reports, dashboards, and any partner integration or ETL job. A forgotten report can fail just as hard as customer traffic.

Pick a rollback point for each phase, not just for the deploy as a whole. After expand, decide how you'll back out if the new columns or indexes cause trouble. After backfill, decide whether you can stop and leave both shapes in place. After switch, define the exact point where you'd send writes back to the old path. If you can't describe the rollback in a sentence or two, the step is still fuzzy.

Then do the math on the backfill. Count the rows, estimate batch size, and work out how long it will run under normal load. A table with 50,000 rows behaves very differently from one with 80 million. Small batches take longer, but they usually hurt production less. That's often the right tradeoff.

Watch the signals that usually show trouble first: query latency, lock waits, error rate, replication lag, and queue buildup. If your migration adds an index or updates rows in batches, slow queries often appear before users report anything.

Run the full sequence in staging with data that looks like production. Toy data hides real problems. If staging can't handle a realistic rehearsal, shrink the first production run. Migrate one table, one code path, or one customer segment first.

Expand without breaking old code

The first deploy should only add options, not remove them. Add the new columns or the new table, keep the old columns in place, and let the current app keep working as if nothing changed. That gives you room to pause or roll back the app without scrambling to repair the database.

Make new fields nullable at first, or give them a harmless default. A required field can break older app nodes, background jobs, or admin scripts the moment they insert a row without it. Null isn't pretty, but it's forgiving, and forgiving is what you want during a live release.

Your code also needs to live with both shapes for a while. Reads can prefer the new field when it has data and fall back to the old one when it doesn't. Writes can stay on the old field for one deploy, or write to both if the logic is simple and well tested.

A safe expand step is usually simple:

  • add the new column, table, or index
  • leave the old column untouched
  • allow null in the new field
  • update the app to read either shape
  • avoid renames and drops for now

Renames are a common trap. They look small, but older code still asks for the old name, and staggered deploys mean old and new app versions may run at the same time. Treat a rename as a multi-step change instead: add the new name first, move data later, then remove the old one after the switch proves stable.

If this part feels a little dull, that's a good sign. Quiet deploys usually start with boring database changes.

Backfill old rows in controlled batches

Review Your Migration Plan
Have Oleg check your expand, backfill, switch, and cleanup steps before production.

Backfills fail when teams try to do too much at once. One query that rewrites millions of rows can lock tables, spike CPU, increase replication lag, and turn a safe release into a late-night fix.

Small batches are slower on paper, but much safer in production. Update a few hundred or a few thousand rows, commit, check the database, then move to the next batch. That gives the app room to keep serving normal traffic.

Pick one clear way to measure progress and stick to it. Most teams use an increasing id, a created_at timestamp, or another stable column. That makes the job easy to resume if it stops halfway through.

A simple backfill loop does five things:

  • selects the next chunk of rows that still need the new value
  • updates only that chunk
  • records the last processed id or timestamp
  • pauses briefly if load rises
  • writes failed row ids to a retry log

That short pause matters more than people expect. If read latency climbs or background jobs start to queue, slow the backfill down. A migration that takes two hours without user impact is better than one that finishes in ten minutes and hurts everyone.

Don't trust the script just because it's still running. Check row counts as you go. Compare how many rows still need backfill before and after each pass, and spot-check a few records to confirm the new data matches the old source.

Some rows will fail. Bad data, odd encoding, or old edge cases tend to show up during backfills. Log them, skip them, and retry later with a smaller script. One stubborn row shouldn't block the whole deploy.

When the remaining count gets close to zero, run one last verification pass before you switch traffic.

Switch traffic to the new shape

The risky moment usually isn't the schema change itself. It's the moment your app stops reading the old shape and starts trusting the new one.

Don't switch reads first. Run dual writes for at least one deploy before the read change, so every new row updates both the old fields and the new ones. That gives the new path fresh data instead of half-empty rows.

Put the new read path behind a feature flag and roll it out gradually. You might send 1% of requests to the new fields, then 10%, then all traffic once the numbers stay steady.

While the flag rolls out, compare both paths on live traffic. The app can read from the new fields for flagged requests while it still calculates the old result in the background and logs any mismatch. Real traffic finds edge cases that test data misses.

Keep an eye on a few simple checks: null rates in the new fields, record counts between old and new queries, totals or status values that should match, and latency if new joins or indexes change response time.

If the numbers drift, turn the flag off first. That's usually the fastest rollback you have. Users go back to the old read path, while dual writes keep both versions current and give you time to inspect the problem.

Even after the switch looks clean, keep the old reads around for one more deploy. Teams get into trouble when they remove the fallback too early, then discover one worker, cron job, or admin screen still depends on the old shape.

Clean up after the switch holds

Cleanup is where a migration either finishes cleanly or lingers for months. Once the app reads the new shape and writes only to it, give the system a little time. Wait until the backfill reaches 100%, error rates stay flat, and support stays quiet. If anything still looks odd, keep the old column a bit longer and watch it.

After that window, remove the compatibility code from the app. Old reads, dual writes, feature flags, and fallback paths make future changes harder. They also confuse new engineers, who can't tell which path still matters.

A cleanup pass usually includes:

  • remove reads from the old column or table
  • stop duplicate writes that kept both schemas in sync
  • delete one-off backfill jobs and retry scripts
  • retire flags that only existed for the migration
  • update alerts and dashboards if they still track the old shape

Don't drop old columns in the same release where you switch traffic. That's where many Friday deploys go bad. Keep the removal in a separate release, after you know the app, jobs, and reports no longer touch the old data. A short delay costs little. A rushed drop can turn a small mistake into a restore.

This is also the right moment to clean up the operational side. Update runbooks, onboarding notes, data diagrams, and any manual query examples. The new shape should become the normal one everywhere, not just in production code.

A simple example: split full_name into two columns

Add AI Release Checks
Set up AI assisted code review and release checks around migration work.

Splitting full_name into first_name and last_name feels small, but it can still break forms, exports, and user profiles if you rush it.

Start by adding first_name and last_name next to full_name. Keep the old column in place and let the old code keep working. At this stage, the new columns can stay nullable.

Then update the app so every save writes to both shapes. If a user edits their profile, the form should still store full_name, but it should also fill the new columns. For a first pass, use plain rules: trim extra spaces, put a single word into first_name, and if there are multiple words, put the first word into first_name and the rest into last_name.

That rule isn't perfect, and that's fine. Names are messy. The goal is to move most rows safely and flag odd cases for review instead of blocking the deploy.

Next, backfill existing rows in small batches. Read rows where first_name or last_name is empty, split the old value, and write the result. Skip rows you've already fixed. If the table is large, process a few hundred or a few thousand rows at a time so you don't hammer the database.

After you verify the results, switch reads to the new columns. Build the display name from first_name and last_name, but keep writing full_name for a short period. That gives you a fallback if reports, admin screens, or email templates still depend on the old field.

Only remove full_name after the switch holds in production and support stays quiet. Make that cleanup a later release. If something looks off, you can still rebuild the old value and avoid a Friday rollback.

Mistakes that trigger rollbacks

Most rollbacks start when a team assumes every part of the system will change at the same time. Real systems don't work that way. One app instance updates first, a worker lags behind, an admin tool still reads the old field, and suddenly a small schema change turns into a production bug.

Renaming a column and deploying app code in the same release is a classic trap. The web app may use the new name, but older app instances, queued jobs, and scheduled tasks can still ask for the old one. Even a short overlap can break writes, return empty data, or throw errors in places you didn't test.

Backfills cause a different kind of rollback. A single huge update during busy hours can lock rows, push up CPU, and slow normal requests. Users feel that right away. Batch the update, limit how much work runs at once, and watch load closely. If the backfill starts hurting the database, stop and resume later.

Teams also miss the parts of the system that sit outside the main app. Background jobs, support scripts, admin panels, reports, and exports often keep old assumptions long after the first deploy. A safe migration includes all of them, not just the user-facing code.

Cleanup is where people get overconfident. They remove old fields because the new app looks fine, but finance reports or internal tools still depend on them. Keep old columns around until reports, jobs, and dashboards all run on the new shape without errors.

Staging can give false confidence. It rarely has stale workers, odd data, production traffic, or forgotten tools. Write down a rollback plan before deploy: which flag to turn off, which code can still read both schemas, and which cleanup step must wait. That habit saves more Friday deploys than luck.

Checks before you press deploy

Cut Friday Deploy Risk
Get an experienced CTO to spot rollback gaps, stale workers, and risky schema changes.

Calm deploys usually depend on a few simple checks, not heroics. If one of them fails, wait. A delayed release is cheaper than a rollback during peak traffic.

  • Confirm the old app version can still read and write with the expanded schema.
  • Confirm the new app version can handle mixed data during the backfill.
  • Watch error rate, query latency, lock time, and replication lag.
  • Make sure you can pause the backfill cleanly and resume it later.
  • Name one person who decides go or no-go.

A small dry run helps more than a long meeting. Deploy to staging or a low-risk environment, run a short backfill against realistic data, and check whether your alerts fire for the problems you expect. If a forced bad row or slow query does nothing, your monitoring is too quiet.

A release is ready when old code still runs, new code tolerates incomplete backfill, the backfill can stop at any point, and one owner can make the call without debate. If you can't say yes to all four, wait.

What to do next

On your next database change, do less in one deploy. Split the migration into four releases: expand, backfill, switch, and cleanup. That one habit makes releases easier to control because old code and new code can run side by side for a while.

Test the rollback path with the same care you give the happy path. It's not enough to confirm that the new version works. You also want to know what happens if the backfill stops halfway, if a read still hits the old column, or if you need to put the previous app version back into production fast.

If your team feels unsure, start with a small change. Add a nullable column, copy data in batches, or begin dual writes before you remove anything. One quiet success teaches more than a stressful rescue late on Friday.

A short plan is enough:

  • split one migration into four releases
  • write down the rollback move for each release
  • run the backfill in small batches with resume support
  • test on staging with messy, realistic data
  • wait before cleanup until the new path stays stable

This pattern works best when one person owns the whole rollout, not just the SQL. Database changes touch code, jobs, alerts, and deploy timing. If any part feels shaky, slow down and keep the first step boring.

If a migration touches application code, workers, and infrastructure at the same time, a second review can save a painful rollback. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor, and this kind of release planning is exactly where experienced technical oversight helps.