Apr 26, 2025·8 min read

Recovery time targets that fit your team and budget

Set recovery time targets that match real outages, staff hours, and budget limits, so your disaster plan turns into clear restore goals.

Recovery time targets that fit your team and budget

Why vague plans fail during an outage

A plan that says "restore fast" isn't a plan. It's a wish, and wishes fall apart when a service goes down at 2:13 a.m. One person reads that line and thinks 15 minutes. Another thinks half a day is fine. Both feel certain until the outage starts.

That gap in expectations causes trouble fast. Sales wants the customer portal back first. Finance wants billing. Engineers may focus on the database because everything else depends on it. None of those priorities are irrational, but vague language forces people to argue when the clock is already running.

A short outage can turn into a messy debate about what to restore first. Every minute spent deciding is still downtime. Teams often blame the outage itself, but confusion is usually the bigger problem. Clear recovery time targets remove that guesswork before anyone is under pressure.

Untested plans hide slow steps that look harmless on paper. A backup may exist, but restore time depends on details: who has the login for the backup system, where restored data should go, how long a database import really takes, who can approve a failover or DNS change, and what still breaks after the service comes back.

Those details rarely appear in a neat document. They show up when someone can't access the admin console, an old script fails, or the only person who knows the process is asleep or on vacation.

Small companies feel this harder than big ones. If you have three engineers, you can't pretend you have a 24/7 recovery team. Your plan has to fit the people you actually have, the tools they know, and the monthly cost you can keep paying.

Picture a small SaaS company that tells customers it can recover "quickly." During an outage, the app server comes back first, but logins still fail because the session store and email service weren't part of the first restore step. The team did restore something quickly. They just didn't restore the right thing.

That's why vague disaster recovery planning breaks down. It leaves timing, order, and ownership open to interpretation, and outages punish ambiguity.

Choose what needs to come back first

Most outage plans fail for a simple reason: they treat every system as if it matters the same amount. It doesn't. If your team is small, you need a short restore order that protects cash flow, customer support, and people getting paid.

Start by naming the systems that stop the business when they go down. For most companies, that means anything tied to sales, support, or payroll. A storefront, payment flow, CRM, shared inbox, phone system, time tracking tool, and payroll app often matter more than internal extras that can wait a day.

A simple sort usually works:

  • Now: systems that must come back first so the business can take orders, reply to customers, or avoid missed payroll
  • Same day: systems that cause pain but don't stop the company in the first hour
  • Later: tools people miss but can work around for a day or two

Keep the groups small. If everything lands in "now," the list is useless.

Give each system one owner. Not a team, not a department, one person. That person doesn't need to perform every restore step alone, but they must know where the backup lives, how to log in, and who can approve access. If three people own it, nobody owns it.

Then write down what each system needs before it can start. This catches hidden problems early. Your billing app may depend on a database, DNS records, cloud storage, API keys, and an email service. Your support inbox may need single sign-on working first. Payroll may depend on VPN access and a finance admin account that only one person has.

A small example makes the point. Say an online business has three top systems: checkout, support email, and payroll. Checkout belongs in "now," but it still can't run until the database and payment credentials are available. Support email is also "now" if customers expect quick replies. Payroll may fit in "same day" unless payday is today.

This is when recovery time targets start to mean something. You're not picking numbers in a vacuum. You're choosing restore goals for real systems, in a real order, with real people and dependencies attached.

Set recovery time targets in four steps

Start with one system, not the whole company. When several things fail at once, people guess, and guesses turn a short outage into a long one.

Good recovery time targets come from a narrow view: one service, one clear failure, one timer.

Write the outage in plain words first. "Customers can't pay," "staff can't open orders," or "nobody can log in" works better than vague labels. Clear wording keeps everyone focused on the same problem.

  1. Pick the service that causes the most pain when it stops. Name the exact thing users or staff can't do.
  2. Ask how long work can stop before the damage gets worse. A 15-minute pause and a 4-hour pause need different plans.
  3. Time a real restore with your current backups, access, and tools. Use a stopwatch. Many teams lose 20 minutes just finding credentials or locating the latest backup.
  4. Set a restore goal your team can hit on a bad day, without late-night heroics or one person rebuilding everything from memory.

That last step matters more than most people expect. A target isn't real if it works only when your most experienced engineer is online and ready to improvise. If recovery takes six people, manual fixes, and luck, the target is too aggressive.

A small example makes this easier to judge. Say a five-person company depends on its invoicing app. The owner wants it back in an hour, but a timed test shows the restore takes 3 hours and 40 minutes. That's the number you can plan around. Then you decide whether faster recovery is worth the extra backup cost, a simpler setup, or written restore steps.

Repeat this only for the few systems that matter most. Focus on the ones that stop revenue, customer support, or daily work. You don't need detailed disaster recovery planning for every tool on day one. A short list of tested restore goals beats a thick plan nobody has tried.

Match the plan to your actual team

A recovery plan works only if the people on call can do the work. Many teams write restore goals as if everyone is available, fully trained, and sitting at a desk at 2 p.m. on a Tuesday. Real outages happen at bad times, and small teams feel that first.

Start by counting actual hands, not job titles. Who can restore the app? Who can bring back the database? Who can reset access, DNS, cloud settings, or VPN access if one system fails and blocks the rest?

In practice, you need an app restore owner, a data restore owner, an access and identity owner, and a backup person for each area. If one person covers three areas, count that as a risk, not a strength. A plan that depends on your only database expert answering the phone during a family trip is wishful thinking.

Recovery time targets should reflect nights, weekends, vacations, and sick days. If your best-case restore takes 90 minutes with two people online, the realistic target may be four hours when one person is alone and digging through notes. That's not failure. That's honest planning.

Small teams usually get in trouble when they promise a fast recovery for every system. Most companies can't restore everything at once. Pick the few services that must come back first, then accept slower targets for the rest. If payroll can wait until morning but customer logins can't, write that down and plan around it.

Drop any promise that needs more people than you have. If your runbook says one person rebuilds servers while another verifies backups and a third tests user access, but you only have two trained people, fix the promise or fix the staffing. Don't keep the gap on paper and hope it disappears during an outage.

The first version should be small enough to practice. One application, one database, one access path, one test restore. Teams learn more from a plain plan they rehearse twice a year than from a thick document nobody trusts. The shorter plan is often the one that gets the business back online.

Decide what faster recovery is worth

Make Runbooks Usable
Turn vague notes into short steps your team can follow at 2 a.m.

Faster recovery costs money. The useful question is whether that spend is lower than the cost of being down.

Start with one hour of downtime. If your store, booking flow, or support queue stops, how much revenue do you lose? Add the quieter costs too: refunds, missed leads, staff sitting idle, and the time spent calming customers.

Now price the upgrade that would cut that hour down. For most teams, the bill isn't just one tool. It usually includes extra backup storage, a spare server or cloud instance, software or backup license fees, engineer time to set it up, and time to test restores every few months.

That last part matters because a cheap backup that nobody has restored under pressure is often a false bargain.

Say one hour of downtime costs about $2,000. If a better setup costs $300 a month and saves two hours in a real outage, the math gets easy. But if the system is an internal wiki that people can live without for a day, spending the same amount makes little sense.

This is where recovery time targets stop being guesswork. Group your systems in plain terms: money-making, customer-facing, internal but important, and low-impact. Then spend first on the group where each lost hour hurts.

Don't buy the fanciest option by default. Find the longest delay in your current restore path and remove that first. Sometimes the slow part is downloading backups. Sometimes it's rebuilding a server by hand. Sometimes one tired engineer is the bottleneck because only they know the steps.

The cheapest fix often wins. A ready-to-run server image, a tested restore script, or a pre-created database replica can save more time than a full second environment.

Leave low-impact systems on slower targets. Archives, old reports, and noncritical internal tools rarely need premium recovery. Save the fast path for the systems that keep the business open.

A simple example for a small company

A small SaaS team with eight people doesn't need a huge disaster plan. It needs a few clear restore goals that match what the team can actually do at 2 a.m. with one engineer awake and one founder on the phone.

In this example, the team sells a subscription product. If payments stop for half a day, revenue takes a hit fast, support tickets pile up, and customers lose trust. So they give billing a four-hour recovery time target.

The marketing site matters, but it doesn't need the same target. If the homepage stays down overnight, the company loses some leads, not existing customers or active revenue. They set that restore goal for the next morning and move on.

That choice saves money. Instead of trying to make every system recover in a few minutes, they spend effort where the pain is real.

They also write one short restore order for shared tools, because outages get messy when people guess:

  1. Restore access to the cloud account and password manager.
  2. Bring up the database backup and confirm the latest good snapshot.
  3. Restore billing services and verify that test payments work.
  4. Bring back internal chat, alerts, and support inbox access.
  5. Leave the marketing site until the urgent work is done.

This list looks almost too basic, but basic beats vague. A tired team can follow five lines faster than ten pages.

The team then reviews where recovery time actually slips. The problem isn't the lack of extra servers. The slow part is the database restore, because backups run too rarely and no one has timed a full restore.

So they fix that first. They tighten backup schedules, store copies in a second location, and practice one real restore test. That usually helps more than buying standby servers they may not even know how to fail over to.

Most small companies should follow that pattern. Put a fast target on the system that protects cash and active customers, let lower-stakes systems wait, and improve the restore steps that waste the most time.

Mistakes that make recovery slower

Check Hidden Dependencies
Find the DNS, login, email, and payment steps that can stall recovery.

A lot of outages last longer than they should because the plan treats every system the same. That sounds fair on paper, but it wastes time when people should be restoring the few things the business needs first. Your customer checkout, internal wiki, and expense tool do not need the same recovery time targets.

Another common miss is the dependency list. Teams restore the app, see servers running, and assume they're done. Then they find out DNS didn't switch, login depends on a third-party identity service, password resets need email, or orders still fail because the payment service is down.

Small companies feel this fast. One missed dependency can turn a two-hour restore into an all-day outage.

Guessing restore times is another expensive habit. Backups may finish every night, but restore speed is a different question. Large databases can take much longer to load than people expect. Access keys may be stored in the wrong place. Someone may need to rebuild search indexes or reissue certificates before users can log in.

Backup restore testing fixes that. Even a simple test gives you real numbers instead of wishful ones.

A few checks catch most of these problems. Restore one production-sized backup into a test environment and time every step, not just the file copy. Write down the outside services the system needs. Confirm who gets paged, who actually responds, and who can approve emergency spending.

Staffing promises cause trouble too. Many plans quietly assume a 24/7 response, even though only two or three people know the system well enough to fix it. That breaks on weekends, during vacations, or when someone is sick. If one person holds all the access, your restore goals depend on that person answering the phone.

Money can slow recovery just as much as tech. During an outage, a team may need extra cloud capacity, contractor help, replacement hardware, or a paid support case. If nobody has pre-approval to spend, people wait for answers while the clock keeps moving.

The better plan is usually the simpler one. Promise less, test it, and make sure the team you have can deliver it.

Quick checks before you sign off

Reduce Single Person Risk
Spread access, ownership, and restore knowledge before one absence slows everything down.

A plan is ready only when someone can use it at 2 a.m. without a debate. If people still need a meeting to decide who acts first, the plan isn't finished.

Run one last review before you approve the document. These checks catch weak recovery time targets before an outage does.

Every system that affects revenue, support, operations, or compliance should have one named owner. If that person is away, name a backup owner too. Every restore goal should come from a recent timed test, not a guess. If nobody has measured a real restore in the last few months, the target is still hope.

The first three outage actions should be written in plain language. People should know who checks the scope, who protects the latest data, and who starts the restore. The budget should cover the full job, including backup tools, storage growth, test time, and on-call time if you expect people to respond outside normal work hours.

Leaders also need to agree on what stays down longer. If the team must choose between payroll, customer logins, and internal reporting during an outage, that choice should already be made.

One missing piece can break the whole plan. A company may say its billing system must return in two hours, but if only one engineer knows the steps and nobody has tested the restore since last year, that target is fiction.

The same goes for cost. Teams often approve a fast restore goal, then skip the extra storage, failover setup, or paid support hours needed to reach it. The paper plan looks cheap. The real outage doesn't.

A short reality check helps. Ask five direct questions: who owns this, when did we last time it, what happens first, what does it cost each month, and what can wait until tomorrow? If the answers are clear and match across the team, you can sign off with a straight face.

What to do next

Start small. Pick the three systems that cost you the most money, support time, or customer trust when they stop working. For many teams, that means the main app, the database, and the login or payment flow. Leave the rest for later. A short list you can test beats a long plan nobody uses.

Write one restore goal for each of those systems. Keep it plain: who restores it, where the backup lives, what "working again" means, and how long the business can wait. That's when recovery time targets start to help. They stop being guesses and turn into numbers your team can prove or challenge.

Then book one restore drill this month. Don't turn it into a discussion session. Have one person run the restore, start a timer, and write down every delay. Missing access, bad notes, slow backup downloads, expired credentials, and unclear ownership show up fast when the clock is running. The real timing matters more than the timing in your spreadsheet.

A simple checklist is enough:

  • Pick the top three systems.
  • Run one real restore drill for one of them.
  • Record the restore time and every blocker.
  • Update the target if the test shows your number is wrong.

Keep the plan current after that. Staff changes matter. Tool changes matter. Architecture changes matter. If one senior engineer leaves, if you switch backup vendors, or if you move from one server to several containers, your old restore goals may no longer fit the team you have now.

Many small companies make the same mistake: they buy more backup tools before they test the restore they already have. Test first. You may find that one missing runbook or one access problem adds 40 minutes, while a new product wouldn't help at all.

If the tradeoffs still feel messy, an outside review can help. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor, and his work includes infrastructure and technical operations for smaller teams. A second opinion like that is often cheaper than discovering, during a live outage, that your restore target was never realistic.

Frequently Asked Questions

What is a recovery time target?

A recovery time target says how long you can let one system stay down before the damage gets too high. Tie it to a real service, like checkout or payroll, not to the whole company.

Keep it plain. If the app takes 3 hours and 40 minutes to restore in a test, don't promise 1 hour.

How do I decide what to restore first?

Start with the systems that protect revenue, customer support, and people getting paid. For many small teams, that means checkout, logins, billing, support email, and the database behind them.

If everything feels urgent, force a simple order: now, same day, and later.

Should every system have the same recovery target?

No. Give fast targets only to systems that hurt the business right away when they fail. Let lower-stakes tools wait.

Your storefront and your internal wiki should not share the same deadline.

How do I set a realistic target?

Run one real restore and time every step with a stopwatch. Count login delays, backup download time, approvals, DNS changes, and app checks after the data comes back.

Then pick a number your actual team can hit at night or on a weekend, not just on a calm weekday.

How many systems should I plan for first?

Start with three. Pick the ones that cost you the most money, support time, or customer trust when they stop.

A short, tested plan beats a huge document that nobody rehearses.

What usually makes recovery slower than expected?

Teams often lose time on missing access, hidden dependencies, and restore steps nobody timed. A server may boot, but users still can't log in because email, DNS, payments, or single sign-on still fail.

One person holding all the knowledge also slows everything down.

How often should we test restores?

Test at least twice a year, and test again after big changes like a new backup tool, a cloud move, or a staff change. If you changed the setup, your old timing no longer tells the truth.

Even one drill can uncover bad notes, expired credentials, or a backup that takes far longer to load than you thought.

How do I make the plan fit a small team?

Write the plan around the people you really have on call. Name one owner and one backup owner for each system, and make sure both can log in and follow the steps.

If one engineer covers the app, database, and access, treat that as a risk and set slower targets until you reduce that risk.

When does it make sense to pay for faster recovery?

Compare the monthly cost of faster recovery with the cost of one hour down. Include lost sales, refunds, idle staff time, and customer cleanup.

Spend first where each lost hour hurts. Sometimes a tested script or a ready server image saves more time than a full standby setup.

When should I ask for outside help?

Bring in outside help when your team argues about priorities, nobody trusts the restore times, or one person holds too much access and knowledge. An outside review can also help before a vendor move or a major architecture change.

If you want a second opinion, a Fractional CTO like Oleg Sotnikov can review the plan, test the restore process, and cut promises that your team can't keep.