Mar 25, 2025·7 min read

Definition of ready for infrastructure changes before deploys

Use a definition of ready for infrastructure changes with rollback steps, target metrics, and clear owner coverage before production work starts.

Table of Contents

Why production changes go wrong

Most production trouble starts before the first command runs. A team calls a change small, low risk, or easy to reverse, but nobody writes down what success looks like. That is usually where a weak definition of ready for infrastructure changes shows up.

Vague plans create false confidence. Someone plans to "update config" or "resize a service," but the request does not say which systems will change, what should improve, or what numbers should stay stable. People fill in the gaps with guesses.

Rollback often fails for the same reason. Teams assume they can undo a change from memory, then discover the old settings are missing, the previous image tag is unclear, or the database step cannot simply run backward. A rollback plan for production changes only helps if a tired teammate can follow it fast, under pressure, without asking the original author what they meant.

Teams also watch the wrong signals. They stare at CPU or memory because those charts are easy to find, while users feel the problem somewhere else: slower page loads, longer queue time, more failed jobs, or login errors. If expected metrics after deploy are not written down first, people can convince themselves the change went fine while the system quietly gets worse.

Owner coverage breaks more changes than most teams admit. The person who knows the change best starts the deploy, then steps into a meeting, goes to sleep, or loses connection. Everyone else is left reading chat messages and guessing intent. A 10-minute fix turns into a long incident.

A small team can hit all of these at once. One engineer changes a caching rule on Friday evening. The dashboard looks calm, so the deploy stays live. Twenty minutes later, error rates rise in one region, checkout slows down, and nobody knows the last working config. The engineer is offline. At that point, the team is not managing a change. They are managing confusion.

What "ready" means before production

A change is ready when the team can explain it in plain language, predict what should happen next, and name who is on point if something goes wrong. If any of that is missing, the change is not ready.

Start with one sentence. Describe the exact change without filler: "Move background jobs to a new queue" or "Increase database connection limits on the API service." If the team cannot explain the change in one clear line, the request is still too vague.

Then say what should improve, or what must stay stable. Maybe page load time should drop, queue delay should shrink, or error rate should stay flat after a config update. This is where the definition of ready for infrastructure changes stops being paperwork and becomes useful. People need something real to check after the deploy.

Ownership must be clear before production work starts. One person leads the change. One person backs them up. The backup is not there for appearances. They need enough context to step in if the lead loses connection, gets pulled into another issue, or decides to roll back.

At minimum, every request needs four things:

a one-sentence description of the change
the expected result or stable condition after deploy
a named lead owner
a named backup owner

These are not nice extras. They are the gate. If even one item is missing, block the change and finish the request first. That pause costs a few minutes. Cleaning up a bad deploy can cost the rest of the day.

What every change request should include

A production change request should read like a short operating note, not a vague summary. If someone else has to take over halfway through, they should know what will change, what will stay untouched, and what success looks like.

Start with scope. Name the systems, services, configs, or database objects that will change. Then name the parts that should not change. That second part matters more than people think. If the request says "update the load balancer config," it should also say whether DNS, TLS settings, autoscaling rules, and application code stay the same.

The request also needs rollback steps that a tired person can follow at night. Put them in order. Include any command, setting, or version to restore, plus a rough time estimate. "Rollback in 10 minutes" is useful. "Revert if needed" is not.

A practical production change checklist usually includes timing, expected user impact, the metrics to check after 5, 15, and 60 minutes, the lead owner, the backup owner, and the exact point where the team will continue, pause, or roll back.

Keep the metrics concrete. Pick a few numbers people already trust, such as error rate, latency, CPU load, queue depth, or failed login count. If a metric should stay flat, say that. If traffic should shift from one node group to another, say how much.

One rule saves a lot of pain: if the owner, backup, timing, rollback, and metrics are not written down, the change is not ready.

How to run the ready check

A ready check works best as a short live review, not a form someone filled in yesterday. Get the people on the change call, open the request, and walk through it line by line. If the owner cannot explain the plan in plain language, the plan is not ready.

Have the owner read the change out loud. It sounds simple, but it catches fuzzy thinking fast. People notice missing steps, hidden assumptions, and bad timing once they hear the plan clearly.

Then ask the owner to show the rollback path on screen. Do not accept "we can roll back if needed." Ask what command runs, who runs it, how long it takes, and what data or config might not return cleanly. A usable rollback plan should feel boring and specific.

Before anyone starts, open the dashboard you will use during and after the deploy. Confirm the exact metrics, the normal range, and how long you expect changes to take. If nobody can answer "what should improve, stay flat, or spike briefly," the team is guessing.

A simple ready check covers five points:

the owner can explain the change without hand-waving
the rollback steps are written and trusted
the team has the right dashboard open before the first action
a backup owner is available and can respond fast
everyone agrees to stop if any part stays unclear

That last point matters a lot. Small teams often push ahead because the window is booked and everyone is already online. That is how messy nights start.

This rule still holds in lean teams: if one person owns the change, another person must be able to step in. A ready check only works if it gives the team permission to pause.

How to write rollback steps people can use

Need Fractional CTO Support

Work with Oleg on safer deploys, infra choices, and team coverage.

Book Call

If the rollback steps only make sense to the person who wrote them, the plan is too thin for production.

Write them as a strict sequence. Do not write summaries like "revert the app" or "restore the old config." Name the exact command to run, the image or version to deploy, the service to restart, and the setting to change.

Each rollback note should answer five questions:

What triggers the rollback, and after how long?
What actions happen in order?
What data will not roll back cleanly?
Who makes the call, and who executes it?
What should look normal again when the rollback finishes?

Be blunt about data risk. Some changes can restore the app but cannot undo side effects. If a migration removes data, or a background job already pushed updates to customers, say that plainly so nobody assumes the system will return to the exact earlier state.

Test the rollback in a safe environment before release day. A staging run often exposes the detail that would cause trouble later, like an old container image that no longer exists or a setting name that changed.

Good rollback steps save time because they remove debate. When the stop point is clear and the actions are exact, the team can revert fast and get the service stable before a small issue becomes a long outage.

Which metrics to watch right away

Keep the watch list short. If it is too long, people miss the signal.

Error rate comes first. If requests start failing, background jobs stop completing, or 5xx responses jump above the normal range, treat it as a real problem.

Response time matters next. For web traffic, watch p95 or p99 latency, not just the average. For workers, imports, reports, or sync tasks, watch job duration and completion time. A system can stay "up" while work slows to a crawl.

Infrastructure health still matters, but tie it to impact. CPU, memory, disk pressure, and queue depth can warn you early, especially after config changes, scaling updates, or database work.

A short watch list often includes:

error rate for the changed service
response time or job duration
CPU and memory on the busiest nodes
queue depth or retry count
one or two user actions such as login or checkout

Those user actions keep the team honest. A dashboard can look fine while customers cannot sign in, create an order, or finish a payment. Pick the actions that bring revenue or unblock daily work, and check them on purpose.

Every metric needs a rollback threshold. "We will keep an eye on it" is not a threshold. "Rollback if 5xx errors stay above 2% for 5 minutes" is clear. "Rollback if report jobs take more than 3 times the normal duration" is clear too.

Who needs to be available during the change

A good change request names people, not just tasks. If nobody owns the change in real time, small issues turn into long outages.

Start with one lead. That person runs the change, watches the clock, and decides when to pause. Teams get into trouble when two people both think they are in charge, or when nobody wants to make the call.

You also need one backup who can take over fast. The backup should know the plan, have the same access, and stay present for the whole window. If the lead loses internet, gets pulled into another incident, or makes a bad call under stress, the backup keeps the work moving.

One more role matters just as much: the person who can approve a rollback. Do not assume that approval will be easy to get in the middle of a problem. Name that person before the change starts, and make sure they know what would trigger a rollback.

A simple owner coverage for deployments looks like this:

the lead responds in chat within 2 minutes and answers calls right away
the backup stays online, follows each step, and can take over within 5 minutes
the rollback approver stays reachable for the full change window

Timing matters more than teams like to admit. Do not schedule a risky change when the service owner is asleep, on a flight, in transit, or working from a bad connection. A database update at 2 p.m. with the right people online is safer than the same update at 11 p.m. when the only person who understands the system is half awake.

If a change needs rare knowledge, make that person part of the plan. Waiting 20 minutes for answers during a bad deploy feels much longer when users are already affected.

A simple example from a small team

Fix Owner Coverage Gaps

Set better handoffs, backup coverage, and rollback approval for live changes.

Talk to Oleg

A five-person startup wants to move its background jobs to a new server. The jobs send emails, process imports, and clean up old records. The change sounds small, but if the queue slows down, customers feel it fast.

The engineer who owns the job system writes the request. It includes the exact deploy steps, the rollback path, and the names of the people on point during the move. Nobody treats "we can switch it back" as a real plan. The owner writes each step in order, including how to point workers back to the old server and how to confirm jobs start draining again.

Before the team starts, they agree on the numbers they will watch for the first hour:

job delay stays under 2 minutes
error rate does not rise above the normal range
support inbox and chat do not show new complaints about missing emails or stuck imports

They also set clear coverage. The main engineer runs the change. A second engineer stays on call for one hour after the move, even if nothing looks wrong. That extra person matters when the owner gets pulled into logs, queues, or a quick config fix.

The cutoff is simple. If job delay goes past the limit, they revert. They do not wait for a longer trend, and they do not argue about whether the spike is "probably fine." They switch workers back, confirm the queue starts clearing, and note what failed before trying again.

That is what a practical ready check looks like. The team knows who does the work, what success looks like, and when to stop.

Mistakes that create avoidable risk

Most bad deploy days start with casual thinking. "We can fix it live" is not a plan. When production breaks, people stop thinking clearly, skip steps, and argue about what changed.

A real rollback plan tells the team what to do, who does it, and how long it should take. It also says what happens if the first rollback step fails. Without that, "we'll just revert" is wishful thinking.

Vague success checks cause a different kind of mess. If the goal is "looks fine," one person will say the change worked while another sees rising latency or failed jobs. Pick numbers before the change starts. Error rate, p95 latency, queue depth, login success, or order completion are much better than gut feel.

Ownership gets blurry in tickets all the time. "Platform team" is not an owner. One person should approve the change, one person should run it, and one person should watch the metrics. In a tiny team, one person may do two of those jobs, but the names still need to be explicit.

Timing also adds risk fast. Late Friday changes are a bad habit. A small mistake at 5:30 p.m. can turn into a weekend outage because the people who know the system are offline, tired, or stuck on a phone.

Teams skip rehearsal when a change seems small. That is where simple edits bite back. A dry run in staging often catches missing permissions, bad environment variables, or a dashboard nobody set up.

A good ready standard blocks all of this. No one touches production until the rollback steps are written, the expected metrics are clear, and coverage is in place.

Quick checks before anyone touches production

Choose Better Deploy Metrics

Define the signals and rollback thresholds that matter after release.

Review Metrics

Most bad infrastructure changes do not fail because the idea was wrong. They fail because someone skipped one small check, then had to guess under pressure.

The rule is simple: if the team cannot answer a few plain questions before the deploy, the change waits.

Use a short pre-flight check every time:

the rollback steps are written down and easy to follow
the team knows which numbers should stay normal after the change
a backup owner can respond right now
the rollback trigger is clear
people who may feel the impact know the timing

This takes a few minutes. It can save hours.

Lean teams need this even more. If one engineer is deploying and also handling alerts, the second owner matters a lot. That is especially true in AI-first operations, where fewer people can run more systems only if the handoff and rollback rules are clear.

What to do next

Write this down once, then reuse it. Your definition of ready for infrastructure changes should fit on one page, use plain language, and take only a few minutes to review before a deploy. If the process feels heavy, people will skip it the first time they get busy.

A short checklist usually works better than a long policy. Keep the same ready check for every production change so the team builds a habit. Over time, that consistency matters more than fancy wording.

Start with a simple template:

what is changing
how to roll it back
which metrics should move, and which should stay stable
who owns the change, and who covers if they step away

Then make one rule: nobody touches production until those four parts are filled in clearly. No half-written rollback notes. No missing owner. No guessing which dashboard to watch after the deploy.

Teams often overbuild this. Try not to. One page is enough for most changes, especially in a small team. If a request needs three documents and two meetings, the process is too big.

If you want an outside review, Oleg Sotnikov at oleg.is works as a Fractional CTO and helps startups and small teams tighten production processes without making them heavy. That kind of review is useful when a team ships often, runs lean, or has very little room for mistakes.

A good ready check is boring on purpose. People can read it fast, use it every time, and trust it when production is on the line.

Frequently Asked Questions

What does a definition of ready mean for infrastructure changes?

It is a short standard your team checks before any production work starts. It says what will change, how you will roll it back, which metrics you will watch, and who owns the work if trouble starts.

What should every production change request include?

Keep it simple and concrete. Write one clear sentence about the change, name the systems in scope, state what should improve or stay stable, add rollback steps, set rollback thresholds, and name the lead and backup owners.

Why do rollback steps matter so much?

Because memory fails under stress. Written rollback steps let any teammate revert fast without guessing which command to run, which version to restore, or who makes the call.

How detailed should rollback instructions be?

Write them so a tired teammate can use them at night without asking questions. Include the exact commands, versions, restart steps, who approves the rollback, how long it should take, and what will not return to the old state cleanly.

Which metrics should we check right after a deploy?

Start with user impact, not just server health. Watch error rate, p95 or p99 latency, job duration or queue delay, and one or two real user actions like login, checkout, or report completion.

What makes a good rollback threshold?

Pick a number and a time window before you start. For example, roll back if 5xx errors stay above 2% for 5 minutes or if queue delay rises past your agreed limit. Clear thresholds stop debate when pressure rises.

Who needs to be available during a production change?

You need one lead who runs the change, one backup who can take over fast, and one person who can approve a rollback. All three need to stay reachable for the whole window.

Can a small team skip the backup owner?

No. Small teams need backup coverage even more because one person often handles deploys, alerts, and fixes at the same time. If the lead drops offline, the backup keeps the change from turning into confusion.

When should we postpone a production change?

Block it when any basic part is missing or unclear. If the team cannot explain the change in plain language, show the rollback on screen, name the owners, or agree on the metrics, wait and finish the request first.

Do we need a long process, or is a short checklist enough?

Keep it short enough that people will actually use it. For most teams, a one-page template and a brief live review work better than a long policy nobody reads under pressure.