Uptime promises for tiny teams you can actually keep
Uptime promises feel simple until a small team has to answer at night. Learn how to match support hours, alerts, and recovery habits to real capacity.

Why this hurts small teams
Customers hear "99.9% uptime" or "24/7 support" and picture someone watching alerts all night, posting updates within minutes, and fixing problems before most users notice. In a tiny company, the reality is often much smaller: a founder with phone notifications on, an engineer who is away for the weekend, and a support inbox that waits until morning.
That gap hurts trust fast. When a service fails at 2 a.m., customers do not care about your staffing plan. They care that the product is down and nobody is answering. Even a short outage feels worse when users get silence instead of a quick update.
There is a cost inside the team too. Nights stop feeling off. Weekends turn into "just in case" time. People bring laptops to dinner and check alerts before sleep. Do that for a few months and the quality of decisions drops. Tired people miss obvious fixes. They also create new mistakes while trying to solve the first one.
This is not mainly a hosting problem. It is a staffing and process problem. Good infrastructure will not help if alerts go to one person who sleeps through them. Clean systems will not save you if nobody knows who leads an incident, who talks to customers, or when to roll back.
Tiny teams usually lose trust during handoffs, not in the data center. Someone promises "we are always available," but the team has no rotation, no written recovery steps, and no clear support hours. Then one outage exposes all of it.
A smaller promise is usually the smarter one. If you cover weekdays from 8 to 6, say that. If urgent issues get a reply within an hour during those hours and the next morning outside them, say that too. Customers can work with limits when the limits are clear. What frustrates them is a promise that sounds much bigger than the team behind it.
What an uptime promise really includes
Most uptime promises fail because they bundle several different commitments into one vague sentence. "We offer 99.9% uptime with fast support" sounds reassuring, but it tells customers almost nothing when something breaks late at night.
A useful promise has four parts:
- the uptime target
- the support hours
- the response time
- the recovery target
Those are not the same thing, and small teams often blur them together.
Response time means a person has seen the problem and replied. Recovery time means the product works again, or the team has a safe workaround. A reply in 15 minutes can still lead to hours of downtime if the database is damaged, a release needs to be rolled back, or the only person with access is asleep.
Customers also need to know who communicates during an incident. One person can work on the fix while another posts short updates. In a very small team, the same person may do both, but there still has to be a rule such as "post an update every 30 minutes until service is back."
When these pieces stay fuzzy, every outage turns into an argument. Customers expect one thing. The team means another. Stress goes up because nobody agreed on what "fast" or "available" actually meant.
The basic rule is simple: a small team can keep systems stable, but only when the promise matches the people, the hours, and the habits behind it.
Start with the team you have
Your promise should start with people, not software.
Before you write a single number, write down who does what during a production issue. Someone needs to notice the alert, someone needs to confirm it is real, someone needs to restart or roll back the service, someone needs to update customers, and someone needs to approve a risky fix. One person may cover several of those jobs. That is fine. It still sets a hard limit on what you can promise.
Then look at the hours when nobody watches alerts. Be honest about them. If your team sleeps from midnight to 7 a.m. and no one carries a phone for production incidents, you do not have 24/7 response. A dashboard is not coverage if no one is looking at it.
This is where many startups fool themselves. They buy monitoring, connect a few alerts, and assume that means they can promise quick recovery at any hour. They cannot, unless someone can wake up, log in, and act.
You also need to check for single points of failure. Maybe only one founder understands the payment flow. Maybe only one engineer can access the database. Maybe only one person knows how to undo the last deployment. If recovery depends on one human being, then your recovery target depends on their availability too.
Ask one more blunt question: can the team restore service without pulling in extra people? If the answer is no, the promise is already too wide. A promise that depends on a contractor replying at night or a founder leaving dinner will fail sooner or later.
This is often the first thing an experienced Fractional CTO cleans up. The job is not to make the promise sound impressive. It is to map alert coverage, access, runbooks, and ownership so the team can recover with the people already on hand.
Set the promise in plain numbers
Many teams get this backward. They pick a pretty uptime percentage first, then realize nobody can answer an alert at 2 a.m. on Sunday.
Start with support hours. If your team watches production from 8 a.m. to 6 p.m. on weekdays, put that first. A lower promise that matches real habits builds more trust than a shiny number that falls apart in the first after hours incident.
For a tiny team, a realistic service promise might read like this:
- Support hours are Monday to Friday, 8 a.m. to 6 p.m.
- Outages affecting core features get a response within 30 minutes during those hours.
- Standard issues get a response within 4 business hours.
- Severe incidents usually recover within 4 hours during support hours.
- Smaller issues are fixed or worked around by the next business day.
That tells customers much more than "99.95% uptime" on its own.
The wording matters too. Skip vague lines like "we aim to resolve issues promptly." That means nothing. Say what happens in plain language:
"We monitor the service during business hours. For outages that stop core features, we reply within 30 minutes and aim to restore service within 4 hours during those hours. For smaller issues, we reply within 4 business hours and usually fix them by the next business day."
That is easy to scan. It is also honest. It leaves out anything that depends on luck, one heroic engineer, or perfect vendor behavior.
A good rule for small teams is simple: trim every line that relies on heroics. Keep the promise you can meet on a normal week, with normal people, after a rough night.
Build the promise from real incidents
Do not build policy from optimism. Build it from your last 10 to 20 incidents.
For each one, write down four times: when the team noticed the problem, when someone first replied, when service came back, and when users got an update. That record shows what your team already does under pressure. It often reveals a pattern quickly. Maybe the fixes are fast but updates lag for an hour. Maybe people answer quickly, then spend too long figuring out the cause.
A practical process looks like this:
- Group incidents by severity. A full outage, failed payments, and a minor admin bug should not all trigger the same response.
- Find the times your team already hits on ordinary weeks. If you usually detect serious issues within 15 minutes during work hours, that is a real starting point.
- Turn those habits into customer facing numbers. Promise reply times, update times, and recovery targets that match your staffing.
- Test the policy for a month. Track every miss, then adjust the promise or the routine.
Keep the first version boring. That is usually a good sign.
Severity levels help more than most teams expect. If nobody can sign in or payments fail for everyone, wake someone up. If one internal report is wrong, it can wait until business hours. That keeps the team calm and gives customers a more honest picture of what happens next.
If you want a rough rule, promise only what the team can hit most of the time without losing sleep. Then review misses after 30 days. Small team reliability improves when the promise fits real recovery habits, not when the team chases a number that looked good on a sales page.
A realistic example for a team of four
Picture a small SaaS company with one founder and three engineers. The founder handles sales, customer calls, and product decisions. The engineers build features, fix bugs, and keep production running.
That team should not pretend it has round the clock support. It does not. A better setup is business hours support for normal issues, with after hours alerts only for true outages.
A realistic policy could be this:
- Support runs Monday to Friday, 9:00 to 18:00 in the company's main time zone.
- After hours alerts are reserved for events that block many users from logging in, paying, or using a core feature.
- If login fails for many users, the team acknowledges the issue within 30 minutes and aims to restore access within 2 hours.
- If pages are slow across the app, the team investigates during support hours and aims for a fix or workaround within 1 business day.
- If all payments fail, the team treats it like an outage. If one account has a billing problem, the team handles it during support hours.
That split matters. A strange invoice reported by one customer at 11:30 p.m. does not need a midnight page. A systemwide login failure does.
The customer facing wording can stay simple: "We respond to general support requests during business hours. Outside those hours, we respond to service outages that block access for multiple users or stop core functions such as login and payments."
That does not sound evasive. It sounds honest. Customers know when they will hear back, and the team knows what deserves a phone alert.
Mistakes that create long nights
The biggest mistake is copying support language from a much larger company. Big companies can promise round the clock coverage because they have shifts, backups, and people whose only job is incident response. A startup with four people usually has none of that.
Another common mistake is chasing numbers too early. A promise like 99.99% uptime leaves very little room for downtime each month. If customers are still the first ones to notice outages, or one person carries the phone every night, that number is fantasy.
Teams also mix up first response time and recovery time. These are different promises. "We reply within 30 minutes" sounds strong, but customers mostly care about when the product works again. A fast reply is nice. A working checkout is better.
Updates are another weak spot. If nobody owns communication, users get silence at the exact moment they want clarity. One person digs through logs, another assumes someone else posted a notice, and nothing goes out. Give updates a clear owner.
The last trap is building recovery around one expert. It works until that person is asleep, on a flight, or burned out. Then a short issue turns into a three hour outage because nobody else knows the restart order or the rollback steps.
If any of these sound familiar, slow down before you publish a bigger promise.
Before you publish
Run a few checks against real incidents, not ideal plans.
Make sure a serious outage triggers an alert within minutes. If customers tell you first, your response clock is already broken.
Make sure at least two people can restore service without waiting for one expert to wake up. Shared access and written steps matter more than polished wording.
Write support hours in plain language. Customers should be able to tell when the team starts reading messages, when that coverage ends, and what happens outside those hours.
Assign updates to one person during an incident. Silence makes even a short outage feel longer.
Then compare the public promise with your last few incidents. If the page says one thing and the incident log says another, change one of them.
That last check is humbling, but useful. Look at the last three incidents and write down when the team noticed the problem, when service came back, and when customers got the first update. Those numbers tell the truth faster than any marketing copy.
What to do next
Put your current promise on the calendar and review it every quarter. Review it again after any outage that turned into a long, messy day. Those moments show the gaps quickly. If people missed handoffs, searched for old notes, or guessed their way through a restore, the promise is ahead of the team.
Do not tighten the promise because it sounds better. Tighten it only after the team meets the current one again and again, both in normal work and during practice drills.
Most teams improve in the same order. First, fix alerts so the right person hears about the problem quickly. Next, write short runbooks for the failures you see most often. Then test backups and restore steps, not just backup jobs. After that, close the obvious gaps for evenings, weekends, and holidays.
Better wording will not improve uptime promises. Better habits will. A one page runbook can save 20 minutes during a failed deployment. A tested restore can save hours when a database breaks.
If the gap still feels messy, an outside review can help. Oleg Sotnikov, whose work is described on oleg.is, advises startups and small companies on product architecture, infrastructure, and AI first engineering workflows. That kind of review is often enough to turn a fuzzy support promise into one a small team can actually keep.
Customers usually trust the smaller promise that you keep. They stop trusting the bigger one the first time you miss it.
Frequently Asked Questions
What should a tiny team promise instead of 24/7 support?
Start with your real support hours and name which outages wake someone up outside those hours. If only login, payment, or full access failures trigger an urgent response, say that plainly. Customers usually accept limits when you make them clear.
Is an uptime percentage enough on its own?
No. The uptime number alone does not tell customers who will answer, when they will hear back, or how long recovery may take. Add support hours, first response time, update timing, and a recovery target.
What is the difference between response time and recovery time?
Response time starts when someone on your team sees the issue and replies. Recovery time ends when users can use the product again or you give them a safe workaround. A fast reply helps, but users care most about getting the service back.
How can we tell if our promise is realistic?
Look at your last 10 to 20 incidents and write down when the team noticed the problem, when someone replied, when users could work again, and when you posted an update. Those numbers show what your team actually does under pressure.
Should we wake someone up for every alert?
Not every alert deserves a midnight call. Reserve after hours pages for outages that block many users or stop core features, and let smaller bugs wait for business hours. That keeps the team rested and the urgent channel useful.
How often should we update customers during an incident?
Set one simple rule and follow it every time. For many small teams, an update every 30 minutes works well during a serious outage. Even a short note builds more trust than silence.
Why do outages feel so painful for small teams?
Most long nights start with people and process gaps, not servers alone. One person holds access, nobody owns customer updates, or the team has no rollback steps written down. Fix those gaps before you promise tighter targets.
Do we really need runbooks if the system is small?
Even a simple service needs written steps for the failures you see most often. A short runbook for rollback, restart order, and restore steps saves time when people feel tired or stressed. It also lets a second person jump in when the usual expert is away.
When should we make our support promise stricter?
Tighten it only after your team meets the current promise again and again in real incidents and practice drills. If people still miss handoffs, search old notes, or depend on one founder at night, keep the smaller promise for now.
Can a Fractional CTO help us set a better uptime promise?
Yes. If your public promise does not match real incident handling, an outside review can clean that up fast. A Fractional CTO like Oleg Sotnikov can check alert coverage, access, runbooks, and ownership so the team can keep the promise it sells.