Infrastructure runbooks that replace heroic fixes
Infrastructure runbooks help teams turn manual deploys, secret fixes, and restore steps into clear routines that anyone can follow under pressure.

Why infra turns into hero work
Infrastructure becomes hero work when routine jobs live in someone's memory instead of in a shared process. A deploy that should feel boring starts to depend on a specific order: update one service, restart another, clear a queue, wait two minutes, then check the logs. If only one person knows that order, the team does not have a process. It has a habit.
Small teams hide this problem well. One engineer remembers that a migration has to run before the API restarts. Another knows which secret broke last month and how they fixed it. Nobody means to keep that knowledge private. They just solve the issue fast and move on.
Then the record ends up in chat. A rushed message says "fixed it" or "rotated the key," but it leaves out the actual steps, what changed, and how to verify that the system is healthy again. The next person has to guess. Guessing creates stress, and stress leads to more shortcuts.
Restores expose the gap even faster. Plenty of teams feel safe because backups exist. That confidence disappears when nobody has tested the restore on a calm day. If the first real restore happens during an outage, people are doing it under pressure while customers wait. Even a simple recovery can get messy fast.
Urgency keeps the cycle going. When production breaks, writing notes feels slower than fixing the problem. Teams skip the deploy checklist, skip the incident notes, and promise to document everything later. Later rarely happens.
That is why runbooks matter. They remove mystery from routine work and lower the team's dependence on the usual fixer - the person who knows the odd restart order, the hidden credential issue, or the one backup command that worked once at 2 a.m.
Which jobs need a written routine
Start with the work people do under stress, late at night, or only a few times a year. If someone completes a task from memory, from old chat messages, or by copying commands from shell history, that task needs a written routine.
A good runbook is not a giant manual for everything. It covers the jobs that can break production, block a release, or leave the team stuck when one person is away.
In most teams, the first batch is easy to spot: deploys that depend on habit, secret changes that only admins know how to make, database restores pulled from old incident notes, DNS edits and certificate renewals that happen rarely enough to be forgotten, and rollback steps nobody wants to improvise during a bad release.
Deploys are usually the first place to look. Teams often say they have a process, but the real process lives in one engineer's head: run the migration, clear a cache, restart a worker, check a dashboard, wait a bit, then switch traffic. Write that order down. Include who does it, what to check before and after, and what success looks like.
Secrets need the same treatment. Password rotation, API key replacement, service account changes, and expired credentials often cause quiet failures. People remember the fix only because they handled it once six months ago. A written routine should say where the secret lives, who can change it, what needs a restart afterward, and how to confirm the new value works.
Restores usually need more detail than teams expect. Backups are only half the job. The real test is whether people can restore the system, start the right services in the right order, and prove that users can actually use the product again.
Find the work that stops when one person is away
Use a blunt test: what stops when one person takes a day off?
If deploys pause, database restores wait, or nobody can fix a broken secret without one engineer, the problem is obvious. The team still depends on one person's memory.
Ask this for every repeating infra task: who can do this today, from start to finish, without asking the usual expert? Use real names. If the answer is one person, mark that task.
The risky work often hides in plain sight. A task may look documented, but the real steps live somewhere else: shell history, a scratch file on one laptop, a private note, or terminal commands copied from an old incident. That still counts as single-person work.
A few warning signs show up again and again. One person knows the exact commands and the order. The task depends on private notes. Only one person has the right access or tokens. People say, "Ask Sam, he usually does it." The work touches production, and nobody wants to guess.
Start with the tasks that can hurt live systems first. Put deploys, rollbacks, restores, secret rotation, DNS changes, and emergency access resets at the top of the list. If a mistake there can take the product down, do not leave it stuck in one person's head.
This shows up a lot in small product teams. One engineer knows the deploy script flags. Another keeps restore notes in a local text file. The founder has the only admin login for a payment or email account. Everyone means well, but one sick day turns a normal issue into a long outage.
Write a deploy checklist people can actually follow
Most failed deploys do not fail because the code is bad. They fail because the steps live in one person's head. A deploy checklist turns that memory test into a routine.
Start with the normal path. Do not try to capture every edge case on day one. If your team usually ships from main to production after tests pass, write that path first. You can add rare branches later.
Write every action as if a tired teammate will run it at 6 p.m. Use exact commands, exact screens, and exact approval points. "Deploy the API" is too vague. "Merge the pull request, wait for CI to pass, confirm the release tag, run the deploy command, check the health endpoint" is much better.
The checklist becomes easier to trust when each step includes the expected result. That gives people a reason to stop early instead of pushing through confusion.
For example, say what should happen after the build starts, after the deploy job runs, after the app health check returns, and after logs or error tracking stay quiet for a few minutes. If the expected result does not appear, the checklist should tell the reader what to do next.
Put the rollback step right next to the risky step. Do not hide it in another document. If a database migration can break the app, the next line should say who decides to roll back, what command they run, and how they confirm that the old version is healthy again.
Approvals matter too. Name the role or person who gives the final go-ahead before production changes. That removes awkward pauses and last-minute guessing.
Then test the draft with someone who did not write it. Ask them to run it once without coaching. If they stop to ask, "Which tag do I use?" or "Where do I check errors?" the checklist still has holes. That simple test is what makes a runbook useful to the whole team instead of only the person who wrote it.
Document secret and access fixes
Secret problems often turn into fire drills because nobody has the full map. A token expires, a teammate loses access, a CI job fails, and the fix lives in chat history or in one engineer's head.
Start with an inventory. For each secret, write down where it lives, who owns it, what uses it, and what breaks if it changes. Put API keys, database passwords, cloud credentials, service accounts, SSH keys, and signing keys in the same format so people do not have to guess.
Keep the record practical. Name the source of truth, such as the password manager or secret store. Name the person or team that approves changes. List every app, worker, cron job, or outside service that uses the secret. Then write the actual rotation steps in order.
That order matters. "Rotate the key" is not enough. People need the sequence: create the new credential, store it in the right place, update the app or job, confirm that it works, and only then revoke the old one. Skip the order and people break production while trying to fix it.
Access recovery needs the same level of detail. If someone loses admin access to cloud, Git hosting, CI, or the password manager, spell out the approved path. Include who verifies identity, who can grant access, and what to do outside normal working hours.
Remove any fix that depends on one personal laptop. If a backup decryption key, SSH config, or certificate exists only on a founder's machine, that is not a process. Move it into a managed store, document how to retrieve it, and make sure another trusted person can follow the steps without hidden context.
When this is done well, access issues become dull. That is exactly what you want at 2 a.m.
Practice restores before you need one
A backup helps only if you can turn it into a working system under pressure. Pick one recent backup and restore it in an isolated environment that is close enough to production to reveal real problems. Do not rely on memory. Use the same commands, access, files, and approvals you would use during a real incident.
Start a timer at the first step. Track how long it takes to download the backup, import data, start services, warm caches, and get the app into a usable state. Teams often guess wrong here. What sounds like a 15-minute restore can easily take an hour once indexes rebuild and workers catch up.
Then check the product like a user would. Can people sign in? Do app files and object storage still match the database? Do queues, cron jobs, and webhooks restart safely? Does monitoring show normal signals afterward? Data can come back cleanly while the product is still half broken.
Write down the point where the responder stops trying random fixes and switches to the backup plan. Be specific. If login checks fail after restore, or if the main database is still unhealthy after a set amount of time, the on-call person should know exactly when to move to rollback or failover.
Run the drill again until two people can complete it without help from the usual expert. On the second round, swap roles. Let one person follow the checklist exactly while the other watches for missing steps, unclear wording, or secret knowledge that never made it onto the page.
A restore procedure should include the timing, the post-restore checks, and the stop or go decision in plain language. If only one person knows how to restore production, you do not have a process yet. You have a dependency.
A small team example
A five-person product team had one deployment method: the founder opened a laptop late at night, ran a few commands from memory, watched the logs, and hoped nothing looked strange. It worked often enough that nobody pushed to change it.
Then a small failure turned into a bad night. A token for the job queue expired right after a release, so background work stopped. New signups still went through, but emails, imports, and billing tasks piled up. The app looked healthy from the outside, which made the problem harder to spot.
The team tried to recover quickly and found another gap. They had backups for the database and object storage, but nobody knew the restore order. Should they restore the database first, then files, or the other way around? Which services had to stay off until the data matched again? They lost time arguing about steps they should have written months earlier.
The fix was simple. They wrote three short checklists: one for deployment with exact commands and rollback steps, one for tokens and secrets with rotation and test steps, and one for restores with the order for database, storage, queues, and app startup.
They kept each checklist short enough to use under stress. One engineer owned deploy updates, another owned secrets, and the founder reviewed restore changes after each backup test.
A week later, a release broke a migration. This time nobody guessed. The engineer on call followed the checklist, ran the rollback, and returned the previous version in a few minutes. The failure still happened, but it did not turn into midnight drama.
Mistakes that keep the cycle going
Teams usually miss documentation in a specific way: they write intentions instead of actions. "Deploy safely" or "rotate secrets quickly" sounds fine until someone tired opens the page at 2 a.m. A useful checklist says which command to run, where to verify health, who approves the change, and when to stop.
Another common problem is fake access. The team says everyone can handle production, but one senior person still has the only working credentials, the only cloud billing login, or the only SSH path that actually works.
Storage matters more than people admit. Runbooks do not help if they live in a private note, an old chat thread, or a repo the on-call person cannot open from a locked-down laptop. Put the docs where responders can reach them during an incident, and keep the latest deploy, incident, and restore steps together.
Teams also skip rollback notes because the last few deploys went well. That is how one harmless release turns into an hour of guessing. Every deploy page should answer a simple question: if this breaks, how do we return to the last known good version, and what data risks come with that move?
The last mistake is letting docs sit untouched after an incident. People patch the server, fix the alert, and move on. Then the same gap appears again a month later. Update the steps while the details are still fresh. Remove dead commands, add missing output, and note the permission that blocked the first responder.
A small team can survive heroics for a while. It cannot grow on them.
How to tell a runbook is ready
A runbook is not done when the steps look neat. It is done when someone new can use it on a normal day without guessing. If a teammate joins next week and still needs a private chat to finish the deploy, the document is only half written.
Start with access. One person should never be the only path to a secret, dashboard, backup store, or deployment tool. At least two people need working access, and both should test it.
The checklist also needs a clear stop point. That might be a failed migration, a missing backup check, a secret that does not validate, or a rollback that behaves oddly. Write the exact signal and the exact person or team to contact. People make worse mistakes when they feel pushed to improvise.
Restore steps need the same reality check. If the business says it can live with 30 minutes of downtime, but the last restore test took two hours, the procedure is not ready. Either speed it up or change the expectation. Hope is not a recovery plan.
A quick review helps. Can a new teammate run the process in a test environment from start to finish? Can two people open every tool and reach every secret they need? Does the document name the stop points and who owns escalation? Did a recent restore drill finish inside the accepted time window? Did the owner clean up old steps after the last change?
That final check gets skipped a lot. Every deploy, access change, or backup change leaves behind small lessons. If nobody updates the notes, the runbook drifts away from reality. After a few months, people stop trusting it and go back to heroic fixes.
If you want a blunt test, hand the document to the person with the least context on the team and watch quietly. Where they pause, your system still depends on memory instead of routine.
What to do next
Start with the task that causes the most last-minute stress. Pick one job that people still treat like special knowledge, such as a deploy, a restore, or a secret change, and write the steps down this week.
Do not wait for a perfect template. A plain checklist is enough if another person can follow it without guessing. Most runbooks start as simple notes and get better each time the team uses them.
A small team can make real progress with a few moves. Choose one painful task and document the exact steps, checks, and rollback path. Give one person ownership so updates do not drift. Put a short restore drill on the calendar and run it like a real event. Read through old incidents and turn repeat fixes into standard routines.
Ownership matters more than polish. If everyone can edit the document but nobody owns it, the checklist gets stale fast.
The restore drill should stay small. Pick a backup, restore it into a safe environment, confirm that the app starts, and note what broke or took too long. Teams usually learn more from one short drill than from months of confident assumptions.
Past incidents are full of missing checklist items. If someone once had to restart services in a certain order, clear a stuck queue, or repair access by hand, that fix should not live only in chat history.
If your team keeps running into deploy gaps, weak restore steps, or unclear ownership, Oleg Sotnikov at oleg.is works as a fractional CTO and startup advisor and helps teams tighten infrastructure routines and reduce risky manual work. A brief outside review can spot the jobs your team has learned to treat as normal.
When one painful task becomes routine, pick the next one. That is how hero work starts to disappear.
Frequently Asked Questions
What does heroic infrastructure work actually mean?
It starts when routine infra work depends on one person's memory. If a deploy, restore, or access fix only works when the usual expert is around, the team has a habit, not a process.
Which tasks should I document first?
Start with tasks that can break production or block a release. Deploys, rollbacks, restores, secret changes, DNS edits, and emergency access resets usually give the biggest payoff because people often do them under stress.
How detailed should a deploy checklist be?
Write it so a tired teammate can run it without guessing. Use exact commands, exact places to check, the expected result after each step, and the rollback action right next to the risky step.
Where should we store runbooks?
Keep runbooks in one place the on-call person can open during an incident. Do not leave them in private notes, old chat threads, or on one laptop.
Who should own a runbook?
Give each runbook one owner. That person updates it after deploy changes, incidents, and restore drills so the steps stay close to real life.
What should a secret rotation runbook include?
For secrets, document where each one lives, who can change it, what uses it, and the exact order for rotation. Create the new credential, update the app or job, confirm it works, and only then remove the old one.
How often should we test restores?
Run a restore drill on a calm day, then repeat it after major infra or app changes. Even a short test exposes timing, access, and startup problems before a real outage does.
How do I know our restore process is good enough?
Look at two things: can two people complete the restore without help, and does the full recovery finish inside the downtime your business accepts. If either answer is no, keep working on the process.
What makes a runbook hard to use?
Most teams write intentions instead of actions. Phrases like "deploy safely" or "rotate the secret" do not help at 2 a.m.; people need the command, the check, the stop point, and the person who approves the next move.
Can a small team fix this without adding a lot of process?
Yes. Pick one painful task this week, write a plain checklist, and test it with someone who did not write it. Small teams improve fast when they stop waiting for a perfect template and start with one real routine.