Aug 27, 2025·8 min read

Uptime SLA promises: set targets your team can support

Uptime SLA promises should match your monitoring, support hours, and incident process. Set terms your team can measure, report, and keep.

Table of Contents

What goes wrong when uptime gets promised too early

Trouble usually starts long before anyone means to cause it. Sales wants a number the buyer will accept. Legal wants wording that looks familiar. Procurement asks for a higher target because it feels safer. Then operations reads the draft and realizes nobody checked whether the team can actually watch, measure, and support that promise.

That is how an uptime commitment turns into daily stress. A line like "99.95% availability" sounds minor in a meeting. On paper, it looks close to 99.9%. In practice, it leaves far less room for outages, puts more pressure on after-hours response, and leaves less space for routine maintenance.

For a two-person team, that gap shows up fast. If alerts fire at 2 a.m., the contract does not care that one person is on vacation and the other is already covering deployments, support, and vendor issues. A target that looked harmless in procurement review can quietly become a standing night and weekend job.

The reporting burden gets missed too. One vague sentence about availability, exclusions, or incident response can force someone to build manual reports every month. They end up pulling logs from different tools, arguing about what counts as downtime, and explaining gaps to the customer line by line. None of that makes the product better. It just burns time.

The business cost is easy to explain. Overpromising raises support costs, pulls engineers away from delivery, and makes refund requests or contract disputes more likely. It can also push a small company into buying monitoring, paging, and backup coverage earlier than planned.

The safer approach is plain and a little boring. Promise what your team can measure today, support with current staffing, and report without manual patchwork. You can raise the target later. Walking back a signed promise is much harder.

What uptime actually means in your contract

Most fights start with one fuzzy sentence. A contract says "99.9% uptime," but nobody agrees on what was up, what was down, or who had to respond.

The first distinction matters a lot. Availability is not the same as response time or restoration time. Availability answers a simple question: could users use the service? Response time is how fast your team starts working on an incident. Restoration time is how long it takes to bring the service back to normal. If you mix those together, you create a promise your team cannot track cleanly.

Your contract should also name the service, not your whole company. If you run a customer portal, an API, and an internal admin tool, say which one the SLA covers. A company-wide promise sounds tidy in a draft, but it pulls in systems customers may never touch and that your team may not monitor the same way.

Then define downtime. A full outage is obvious. Partial failure is where arguments start. If logins fail for 40% of users, or payments stop while the homepage still loads, does that count? Write it down before legal and procurement broaden the wording.

A short definition block usually saves weeks of back and forth:

Covered service: the named product, API, or customer-facing feature set
Availability: whether that covered service works for users during the measurement period
Incident response: how fast your team acknowledges the issue and starts work
Restoration: when the service returns to normal use
Exclusions: events that do not count against the uptime number

Maintenance windows need the same clarity. If you patch systems every Sunday at 2 a.m., say whether that time is excluded from the calculation. If maintenance can affect customers, require notice and set a limit. If you leave this vague, routine work can look like a breach on paper.

A clean SLA definition is boring by design. That is a good thing. Boring language is much easier to run than a promise your team has to argue about every month.

Map your real support coverage

Many uptime commitments fail for a simple reason: the contract assumes someone is always ready, but the team is not staffed that way. Before you agree to any target, map who actually watches the service and who can act when something breaks.

Start with business hours. Put real names next to alert handling, not team names. If alerts go to a shared inbox or a noisy chat channel, treat that as weak coverage. One person should own first response during work hours, and someone else should back them up when they are in meetings, on leave, or buried in another issue.

After-hours support is where the gap usually becomes obvious. Many small teams can support a product well from 9 to 6, but they do not have a real evening, weekend, or holiday rotation. That is fine, as long as the contract says so.

Ask four direct questions. During work hours, who gets the first alert? After hours, who checks and responds? On weekends and holidays, who can log in and fix the issue? If the first person misses the alert, who takes over?

A small startup with two engineers may think it offers 24/7 support because both founders keep their phones nearby. That usually falls apart after a few weeks. People sleep, travel, get sick, or miss notifications. Contracts should match the team you have now, not the team you hope to hire later.

Then check the speed at each step. How long does it take the team to see an alert, confirm that it is real, and escalate it to the person who can fix it? Those are separate actions. If alerts arrive instantly but nobody reviews them for 45 minutes after midnight, your real support window is slower than the draft suggests.

Write this down in a simple table for weekdays, nights, weekends, and holidays. Once you compare promised coverage with actual staffing, weak spots become obvious. You may find that you can support a strong business-hours SLA today and add after-hours coverage later. That is much safer than signing up for round-the-clock support your team cannot sustain.

Check what you can measure today

Your SLA should rely on systems your team already uses every week. If the number comes from screenshots, memory, or a spreadsheet someone updates by hand, the promise will fail as soon as a customer asks for proof.

List the exact tools that show availability today. For some teams that is an external uptime monitor. For others it is cloud health checks, a load balancer report, or a Prometheus and Grafana dashboard. If you also use Sentry, logs, and incident tickets, note that too. You need one clear source for the uptime number and one simple rule for how you calculate it.

Then trace where the evidence lives. Logs may sit in one place, alerts in another, and incident notes in chat. That is common, but it becomes a problem when procurement asks for monthly reports or service credits. You should know where raw logs live, where alert history lives, and where someone records outages, planned maintenance, and status updates.

A few checks expose most weak spots. Which tool gives the official uptime number? Where can the team see alert history for the full month? Where do incident timelines and maintenance records live? Can someone export the data into a monthly report without manual cleanup?

If any answer is vague, the metric is not ready for contract language.

Monthly export matters more than teams expect. A metric is only useful if someone can pull it at the end of the month, explain how it was calculated, and use the same method next month. If one person has to stitch data together from dashboards, chat, and ticket notes, the process will break under pressure.

Drop any metric you cannot measure the same way every time. Do not promise degraded performance thresholds if you do not track them. Do not promise response times by severity if severity is not recorded in a consistent system. Tight wording sounds good in a draft, but unmeasurable promises turn into weekly arguments.

A smaller promise with clean evidence is better than an ambitious target nobody can defend.

Set the target step by step

Clean Up SLA Wording

Define downtime maintenance and exclusions before vague text creates monthly disputes.

Start Review

Start with facts, not ambition. If your service has stayed around 99.6% to 99.8% over the last few months, do not promise 99.95% just because it sounds better in a draft. Most bad SLA commitments come from describing the service you want someday, not the one your team runs today.

Pick one clear boundary first. Choose the part customers actually rely on, such as the public app or API, and leave internal tools, admin panels, and third-party add-ons out unless you can monitor them the same way. One boundary keeps the number honest and makes disputes less likely.

Then choose one measurement method and stick to it. External checks every minute are easy to explain. Mixing internal logs, support tickets, and manual judgment usually creates a fight at the end of the month.

A simple process works well. Review your real uptime for the last three to six months. Choose one service boundary and describe it in plain language. Use one source of truth for measurement. Match the target to your staffing, failover setup, and support hours. Then make sure someone can report the number every month without inventing a process on the spot.

Staffing matters more than many teams admit. A small team with weekday support should be careful with very high availability targets. If nobody watches alerts at 2 a.m., the contract should not quietly assume a round-the-clock response.

Add maintenance rules before procurement turns them into vague legal text. Say when planned maintenance can happen, how much notice you give, and whether that time counts against uptime. Name exclusions clearly too, such as customer-side internet issues or outages inside a third-party provider you do not control.

One quick test helps. Ask your team to produce last month's uptime number using the method in the draft. If engineering, support, and finance each get a different result, the target is not ready.

A realistic first target often beats an impressive one. If your service currently runs at 99.7%, you might commit to 99.5% for the public API, exclude planned maintenance, and raise the number later after improving redundancy and support coverage.

Write the hard edges clearly

A vague SLA looks polite in a draft and painful in real life. If support only runs on weekdays, say that plainly. Write the hours in a fixed time zone such as UTC, and add the customer's local time in brackets if needed.

That small detail avoids a common fight. A customer reads "business hours" as their business hours. Your team reads it as yours. The contract should remove that gap before anyone signs.

Planned maintenance needs the same level of detail. State when you can do it, how much notice you give, and whether that time counts against uptime. If Sunday 02:00 to 04:00 UTC is your maintenance window, write exactly that.

You also need clear boundaries around incidents you do not control. If the customer breaks their own setup, changes DNS, disables a required integration, or ignores a security requirement, that is not your outage. The same goes for failures at outside providers when your service itself still works as designed.

Most teams can cover this with four short clauses: support hours and response windows, planned maintenance notice and timing, excluded causes of downtime, and the service credit method with a clear cap.

Service credits should fit the size of the deal. Small contracts should not carry open-ended penalties. A simple credit table tied to the monthly fee is easier to manage and easier to defend in procurement review.

For example, if a customer pays $3,000 per month, a credit capped at part of that month's fee is reasonable. Promising broad refunds, extra work, and penalty payments on top of credits is how an uptime promise turns into a support burden your team cannot carry.

Simple words help. Write "we respond within 4 business hours" instead of "commercially reasonable efforts shall be undertaken." Write "downtime does not include customer misconfiguration" instead of a paragraph that invites debate.

If one engineer covers after-hours alerts, do not let procurement turn that into wording that sounds like a fully staffed 24/7 operations desk. The contract should describe the team you have, not the team someone imagines.

A simple example before procurement rewrites the draft

Narrow the Service Boundary

Keep internal tools and outside providers out of the SLA unless you can truly support them.

Review Scope

A startup sells one hosted app and puts "99.95% uptime" into an early draft. On paper, it sounds reasonable. In practice, it allows about 22 minutes of downtime in a month, which is tight for a small team with one on-call rotation and no full-time support desk.

Then procurement sends back edits. They want service credits to start sooner, a wider definition of outage, and faster response times. Suddenly, a promise that looked simple becomes a daily operations risk.

The first problem is scope. Procurement may try to count almost anything as downtime: slow pages, broken third-party integrations, failed customer networks, planned maintenance, and admin tools that end users never touch. If the team accepts that language, they are no longer making a promise for one app. They are taking blame for systems they do not fully control.

So the team narrows the draft. They define availability as the production customer-facing app only, measured by their own monitoring from the public internet. They exclude scheduled maintenance, beta features, customer-side internet issues, and outages caused by third-party providers outside their system boundary. They also separate incident severity from support hours.

That leads to a version they can actually support:

99.95% uptime applies only to the production hosted app
Uptime is measured monthly by the team's monitoring checks
Planned maintenance does not count if the team gives notice
Only complete or major loss of service counts as an outage
24/7 on-call covers severe incidents, while normal support follows business hours

They adjust credits too. Instead of triggering credits after every short blip, they start them only when monthly availability drops below the target by a clear amount. That gives the team room to handle brief incidents without turning every event into a contract dispute.

The final draft is less flashy, but much safer. Sales can still offer a strong target. Operations can monitor it, support it, and explain it. Legal gets cleaner language. That is a far better place to be before procurement pushes for terms your team cannot meet.

Mistakes that turn a promise into daily pain

A bad SLA often starts with one shortcut. Someone copies language from a much bigger vendor, changes the company name, and assumes the rest will sort itself out. It will not. A 200-person support team can promise things that a five-person team cannot keep without burning out.

This gets worse when sales language, legal edits, and hopeful operations assumptions all end up in the same draft. The contract sounds clean, but the team carrying it every day knows it is built on guesswork.

Most of the pain comes from a few repeat mistakes. Teams copy enterprise SLA text without checking who will answer alerts, own incidents, and approve exceptions. They promise 24/7 support because the customer expects it, even though nobody is actually on call around the clock. They count third-party services inside the uptime target even when they do not control them. They mix uptime, response time, and resolution time into one fuzzy promise. And they forget that finance needs a clear way to calculate service credits.

One common mess looks like this: the contract promises 99.95% uptime, 24/7 response, and service credits for any outage over 15 minutes. The company has one daytime engineer, basic monitoring, and several outside dependencies. That is not an ambitious target. It is a trap.

Before procurement edits the draft, compare every promise to one plain question: who will monitor it, who will support it, and how will you measure it on a bad day? If nobody can answer all three, the wording needs to change.

Quick checks before you send the contract

Review Your SLA Draft

Oleg checks targets exclusions and credits before the contract turns into an operations problem.

Book Review

Send the draft only after you test the promise against real operations. This is where an uptime commitment stops being a sales line and starts becoming an on-call obligation.

Start with detection. If your monitoring needs 20 minutes to catch an outage, a 5-minute incident response line makes no sense. Your team should know how fast it can spot a real issue today, with the tools already in place, not with tools you plan to add later.

Then check alert acknowledgment. A contract can say "15 minutes," but someone still has to see the page, confirm the problem, and start the response. If nights, weekends, or holidays depend on one tired engineer checking a phone, trim the promise or limit the support window.

A small mismatch between finance and ops causes trouble fast. Credits look simple until both teams use different formulas. One team may count a partial outage as downtime while the other excludes it. Agree on the exact math before the customer asks for a credit note.

Your monthly report matters too. If you cannot produce a clean uptime report from your current monitoring stack, the contract is ahead of your process. You want one source of truth, a clear reporting period, and a simple way to explain maintenance windows, excluded events, and incidents that affected only part of the service.

A short internal check catches most problems:

How many minutes does it take to detect a real outage?
Who responds during evenings, weekends, and holidays?
Can finance and ops calculate one sample credit the same way?
Can you pull a sample monthly uptime report from current data?
Are you promising anything your team cannot support every month?

This check is dull, but it saves a lot of pain. If you need another set of eyes, a fractional CTO can pressure-test the wording against your monitoring, staffing, and reporting before procurement turns a loose draft into a hard obligation.

Next steps before you sign

A contract gets expensive when the promise looks clear on paper but falls apart in day-to-day work. Before anyone signs, put the latest draft in front of the people who will carry it: ops, support, finance, and legal. Each group sees a different risk, and the weak spots usually show up fast.

Ops should confirm what the team can monitor, who gets alerts, and how incidents move from detection to response. Support should check coverage hours, escalation paths, and who handles nights, weekends, and holidays. Finance should review service credits, penalty exposure, and the real cost of extra coverage. Legal should make sure the wording matches what the team can actually do.

Be strict with numbers. If an availability target depends on a tool you do not have, remove it or rewrite it. A promise based on future hires, planned monitoring, or a dashboard you hope to build later is still a guess.

The same goes for reporting. If the contract says you will measure downtime, exclusions, response times, or third-party outages in a specific way, make sure you can produce that report today. If you cannot, the contract is ahead of your operation.

One final pass helps. Can you measure uptime with your current tools? Can your team support the stated hours every week? Do service credits fit the size of the deal? Does the language match your real escalation process? Would the promise still feel reasonable during your busiest month?

If the team already feels stretched, pause and get another opinion before you sign. On oleg.is, Oleg Sotnikov reviews SLA terms, monitoring setup, and support coverage with a practical operations lens, which is exactly the kind of review that can catch trouble early.

The best draft is usually the plain one. It matches the staff, tools, and support hours you have right now. If that means offering a smaller promise today, that is still better than signing a bigger one your team has to defend every month.