Apr 13, 2026·8 min read

On-call without burnout for a five-person engineering team

On-call without burnout starts with fewer alerts, a simple rotation, and clear runbooks. Use this plan to protect sleep, weekends, and team morale.

Table of Contents

Why small teams burn out on call

In a five-person team, every page hits hard. One person loses sleep. Another has to cover planned work the next day. The rest of the team feels the gap almost right away.

Bigger teams can spread that pain around. Small teams cannot.

The cost is not just the time spent fixing the issue. Broken sleep shows up later as slow thinking, short tempers, and sloppy calls. A 15-minute alert at 2 a.m. can ruin half the next day. If that happens twice in a week, people stop trusting the schedule and start dreading their turn.

Weekends wear people down in a different way. Even when nothing breaks, the person on call often stays in waiting mode. They stay close to a laptop, skip longer plans, and keep one ear open for a notification. That constant background tension adds up. It also makes people resent systems that page for noise instead of real customer problems.

Small teams also have less room for fuzzy ownership. If an alert fires and nobody knows whether the app, database, queue, or cloud setup caused it, recovery slows down fast. People wake the wrong teammate, repeat the same checks, and guess under pressure. Burnout grows from that pattern. One bad night is rough. The same messy night over and over is what really drains people.

The fix is boring, and that is good. Send fewer pages. Make it obvious who owns each problem. Give the on-call person a short path from alert to first action. When alerts are quieter, ownership is clear, and recovery is faster, weekends start to feel like weekends again.

What good coverage looks like

A five-person team should cover failures that hurt users now, not every message that arrives after hours. If everything is urgent, nothing is.

Start by naming the few things that should wake someone up. For most teams, that list is short: the product is down, sign-in or payment is broken, data may be lost, or there is a security issue. A slow report, a typo, or a one-off customer request can wait until business hours.

That split matters more than many teams expect. On-call is for incidents. Routine support belongs in a normal queue with normal working hours and a clear owner.

Three priorities are usually enough. P1 means the service is down, customers cannot use a main path, data is at risk, or there is a security problem. Someone responds within 15 minutes, any time. P2 means a main feature is degraded but there is a workaround. That gets a response within an hour during the day, and by the next morning at night or on weekends. P3 covers low-impact bugs, internal requests, and cleanup work. Those stay in business hours.

Keep the rules concrete. Most small teams do fine with three or four service priorities, and many only need three.

It also helps to agree on the parts of the product that matter most. For a lot of teams, that list is short: sign-in, the main customer workflow, billing, and data integrity. If one of those breaks, page someone. If an internal dashboard looks odd, nobody should lose sleep over it.

A simple example makes the difference clear. If checkout fails for every customer on Saturday, page right away. If one customer needs help exporting a file, log it for Monday unless a contract says otherwise.

That is how you protect the business without treating every after-hours issue like a fire.

Build a rotation people can live with

A five-person team cannot act like it has a full support department. If everyone feels half on call all the time, nobody really rests.

A simple setup works best: one primary person for the week, one backup, and one person fully off the rota. The other two work normal hours and can help during the day, but they do not carry after-hours responsibility. That protected week off the rota matters more than it seems. It gives someone a full break from pager stress and from the habit of checking just in case.

Weekly rotation is usually the sweet spot. Daily handoffs sound fair on paper, but they create more confusion, more missed context, and more "I thought you had it" moments. A full week gives the primary enough time to learn the current state of the system, spot repeat issues, and settle into a routine.

A five-week cycle can stay simple. In week one, Alex is primary, Sam is backup, and Priya is off rota. In week two, Sam becomes primary, Priya is backup, and Jordan is off. In week three, Priya is primary, Jordan is backup, and Alex is off. In week four, Jordan is primary, Mei is backup, and Sam is off. In week five, Mei is primary, Alex is backup, and Priya is off.

That pattern is easy to remember, and people can plan weekends, travel, and deep work without guessing.

Write swap rules before anyone needs a favor. If you wait until someone gets sick or has a family event, the conversation gets awkward fast. Keep the rules plain: swap whole weeks when possible, ask early, make sure the backup agrees before anything changes, update the shared calendar right away, and keep the current primary responsible until the handoff is confirmed.

What protects people most is not a clever schedule. It is fewer handoffs and clear ownership. People can handle a hard week now and then. What wears them down is coverage that feels messy and never fully ends.

Write runbooks people will use

A runbook has one job: help a sleepy engineer make the first safe move in under a minute.

If someone gets paged at 2 a.m., they should not have to guess, search chat, or rely on something they heard six months ago. Start with one page for each service. Name the service, explain what it does in one line, and name the owner or backup owner. In a small team, that alone removes a lot of stress because people know who can answer the weird question no one else can.

Each runbook should cover the same small set of details: the first checks to run, the most likely causes, the exact rollback or disable step, the command or dashboard to use, the log location, and the point where the on-call person should wake someone else up.

Plain language beats perfect detail. Write the first action the way you would text a teammate: "Check error rate first. If it jumped after the last deploy, roll back that release." That is far more useful than a wall of theory.

Keep everything in one place. If the runbook says "open Grafana," include the dashboard name. If it says "check logs," say which service and which filter to use. Teams using Grafana, Prometheus, and Sentry should list the exact saved views or query names, not just the tool itself. Small gaps waste the most time.

A payments API runbook might start with three checks: recent deploys, error rate, and database connections. The likely causes might be an expired secret, a failed migration, or a third-party timeout. The rollback step should be one clear command or one release tag, not a paragraph.

Review the runbook after every real page. Remove stale steps, add the clue that actually helped, and cut anything nobody used. That habit does more than writing a giant handbook once and forgetting it.

Trim alerts before they reach a phone

Clarify Service Ownership

Make every alert point to the right owner, runbook, and next step.

Review Ownership

Most small teams do not have an incident problem. They have a filtering problem.

If every wobble reaches a phone, the rota fails fast and people stop trusting the alerts. A simple rule fixes a lot: page someone only when users feel the problem now. If the app is down, sign-in is broken, payments fail, or error rates jump high enough that customers cannot finish a task, call a human. If the issue can wait until morning without hurting users, send it to a team channel instead.

This matters even more in a five-person rotation. One noisy weekend can poison the rest of the month.

Teams often keep too many alerts because it feels safer. In practice, the opposite happens. People learn to ignore pings, and the alert that matters arrives with the same tone as a harmless warning.

A few simple rules help. Page only for user-facing failures or real data-loss risk. Group repeat alerts from the same root cause into one incident. Send low-risk warnings to a channel, not a phone. Add quiet hours for issues with enough buffer to wait until morning. Delete alerts that never lead to action.

Grouping matters more than many teams think. In Prometheus, Grafana, or Sentry, one database slowdown can trigger several alarms. The person on call does not need five pages. They need one incident with enough context to start.

Quiet hours help too. If disk usage is rising but there is still plenty of headroom until 9 a.m., let it wait. Save night pages for problems that are live, visible, and getting worse.

A blunt cleanup test works well: look at the last 20 alerts and ask, "Did someone take a clear action because of this?" If the answer is no, change it or remove it.

Roll it out step by step

Start small. If you try to redesign every service, every alert, and every handoff in one week, the team will get tired before the rota even starts.

Pick one service first, ideally the one already causing the most noise. Then look at one week of real incident and alert data. That is long enough to show patterns and short enough that people still remember what happened.

Write down how many pages each person would have received during that week. Split the count into total pages and pages after midnight. Those numbers tell different stories. Ten daytime alerts might be annoying. Two alerts at 2 a.m. can wreck a weekend.

Then fix the noisiest alert before adding new rules. The worst offender is often a low-signal alert that fires, clears, and fires again. Tighten the threshold, add a time window, or remove it if nobody acts on it.

Create a short runbook for the service you picked. Keep it plain: what broke before, what to check first, how to confirm customer impact, and when to escalate. If an engineer cannot use it while half awake, it is too long.

End each on-call week with a handoff note. Keep it brief. Include open issues, flaky alerts, recent deploys, and anything likely to page the next person. Then test that note by asking the next engineer to follow it and point out what is missing.

Run the schedule for two or three rotations before judging it. After that, change the order, backup rules, or shift times based on what actually happened, not on guesses.

This slow rollout works better than a big policy document. Small teams learn more from one noisy service than from ten clean diagrams. If you already track errors and uptime in Sentry, Grafana, or Prometheus, use that data to keep the setup honest. The goal is simple: fewer surprise pages, calmer weekends, and a rota people do not dread.

A weekend incident, handled well

Fix Noisy Paging

Get a practical review of alerts, thresholds, and what should reach a phone.

Book Review

Saturday at 8:12 a.m., a payment job starts failing after a small config change. New orders still come in, but settlements stop moving. The alert goes to the primary person on call, and only that person. Nobody else gets a noisy wake-up for a problem one engineer can probably fix in a few minutes.

The alert is short and useful. It says the payment worker failed three runs in a row, shows the affected service, and includes the runbook name.

The primary opens the runbook first, not Slack. The first page answers three things right away: where to check logs, how to measure customer impact, and what to roll back if the latest change caused the failure. That saves time because nobody has to guess which dashboard matters or which command is safe.

In the logs, the engineer sees a parsing error tied to the morning deploy. The runbook says to confirm whether payments are stuck or only delayed. A quick check shows about 40 orders waiting, with no double charges and no lost data. That matters. A delay is serious, but it is not the same as data corruption.

The engineer rolls back the worker, restarts the failed job, and watches the queue drain. Ten minutes later, new payments process normally again. The backup never joins because the first fix works, customer impact stays limited, and the runbook does not call for escalation.

That last part matters. Backup should help when the primary gets stuck, when the issue spreads, or when the fix could make things worse. Backup should not join by default just because an alert fired.

On Monday, the team makes one small runbook update. They add the exact log filter that exposed the parsing error, so the next person can find it in seconds instead of hunting for it. Small edits like that do more than long postmortems. Over time, they turn weekend incident response from a team interruption into a short, contained task.

Mistakes that break the schedule

Most schedules break long before anyone quits. They break when the phone goes off for noise, when nobody trusts the handoff, and when one small bug wakes half the team.

The fastest way to ruin a rota is to send every error to the on-call phone. A failed background job, one slow query, or a retry that heals itself should not drag someone out of bed. Page people only for problems users feel right now, or for problems that will turn into user pain soon.

Changing the rota every few days causes a different kind of stress. People stop planning their weekends because the schedule feels slippery. Keep the pattern boring. In a five-person team, longer blocks usually work better than constant swaps because everyone knows when they are truly off.

Runbooks matter most at 2:17 a.m. If the only fix lives in one engineer's head, the rota turns into a hidden single point of failure. Write short steps for the common messes: how to check impact, where to restart a worker, when to roll back, and when to wake a second person.

A lot of teams also wake two or three people for one problem. That is panic, not process. One person should own the first response. They can pull in another engineer only if the runbook sets a clear line, such as data-loss risk or a broken payment flow.

Internal tools need a higher bar before they disrupt sleep than customer-facing systems. If an internal dashboard is down at midnight but customers can still buy and log in, it can wait until morning. Treating every internal tool issue like a public outage teaches the team that priorities are fake.

A quick gut check helps. Would a customer notice this right now? Is money, data, or security at risk? Can one person follow a written playbook? If you wait until morning, what gets worse?

If most answers are no, do not page. Protecting weekends is often less about heroics and more about saying, calmly, "this can wait."

Checks before you go live

Review Sentry and Grafana Alerts

Tune noisy rules and group repeat failures before they wake the team.

Review Alerts

A rota can look neat on paper and still ruin a weekend. Before you turn it on, do a short reality check with real numbers, real runbooks, and real people.

Start with last month's pages. Count how many alerts fired, then count how many turned out to be noise. If half the pages did not need a human, fix that first. A five-person team does not have enough slack to carry alert fatigue for long.

A short pre-launch check works well. Count total pages from the last month and mark the false alarms. Make sure every alert has one clear owner, not "the team." Open three runbooks at random and time how long it takes to find the first action. Check vacation weeks now and name the backup for each gap. Ask each engineer to explain the rota in one minute, including who covers swaps and when escalation starts.

That runbook test matters more than people expect. If someone opens a runbook at 2:10 a.m. and spends five minutes hunting for the first command, the document failed. Good runbooks start with the first safe move, the expected result, and the point where the engineer should wake someone else up.

Ownership also needs to be plain. An alert should say who owns the service, where the runbook lives, and what counts as urgent. If an alert lands and the on-call person still has to guess which system it belongs to, you are not ready.

Do one dry run before the first full week. Pick a small incident, send the page, and watch what happens. If people hesitate, argue about coverage, or cannot explain the rota, pause and fix it. That hour of cleanup does more than any fancy schedule.

What to do next

Run the new rota for 30 days before you judge it. A five-person setup can feel fine in week one and rough by week three, so wait for real data before changing it again.

Track a small set of numbers during that month: how many alerts woke someone up, how many were false alarms, how long it took to respond, and how many incidents needed a second person. If weekend pages keep landing for the same harmless issue, the rule is wrong or the system needs a fix.

A short review at the end of the month is enough. Keep the alerts that lead to action within a few minutes. Remove or downgrade alerts that people keep closing without doing anything. Fix the top repeat issue before adding more rules. Check whether one person carried more painful shifts than the rest.

Runbooks need the same discipline. Right after an incident, open the runbook and fix it while the details are still fresh. Add the exact error message, the first check that helped, the command or dashboard that saved time, and the point where the issue should escalate. If you wait a week, people forget the awkward little detail that made the fix obvious.

Be strict about what stays in the system. Rules, handoffs, and runbook steps should earn their place by cutting noise or helping someone solve the problem faster. If they do neither, remove them.

Sometimes teams need an outside review because they are too close to the mess. Oleg Sotnikov at oleg.is does this kind of work as a fractional CTO, especially for small companies tightening alerts, handoffs, and runbooks while moving toward more AI-assisted engineering and operations. A short review can catch noisy paging rules, thin runbooks, and schedule gaps before they turn every weekend into a fire drill.

Frequently Asked Questions

What should wake someone up at night?

Wake someone up only for issues that hurt users now or put money, data, or security at risk. If the product is down, sign-in fails, payments stop, or data may be lost, page the primary right away. If it can wait until morning without real damage, keep it out of the phone.

How often should a five-person team rotate on call?

Most five-person teams do best with a weekly rotation. A full week gives the primary enough context to spot repeat issues and avoids messy daily handoffs. Keep one primary, one backup, and one person fully off the rota each week if you can.

Do we really need a backup every week?

Yes, but the backup should stay quiet unless the primary gets stuck, the problem spreads, or the fix could make things worse. That keeps one alert from turning into a team-wide interruption.

What should an on-call runbook include?

Keep each runbook to one page per service. Start with what the service does, who owns it, the first checks to run, the most likely causes, the safe rollback step, where logs and dashboards live, and the point where the primary should wake someone else.

How do we cut down noisy alerts?

Look at recent alerts and ask one blunt question: did anyone take a clear action because of this page? If not, tighten it, group it with similar alerts, send it to a channel, or delete it. Phones should get real incidents, not every wobble.

Should internal tool issues page the team?

Usually no. If an internal dashboard breaks at midnight and customers can still sign in, buy, and finish the main flow, let it wait until business hours. Save after-hours pages for customer-facing failures and real risk.

When should the primary call the backup?

The primary should pull in the backup when the runbook says to escalate, when data loss looks possible, when payments or another main flow stay broken, or when the primary cannot find a safe fix fast. Call early if the issue grows. Do not call just because a page fired.

How should we handle swaps and vacation weeks?

Set swap rules before anyone needs them. Swap whole weeks when possible, ask early, confirm the backup agrees, and update the shared calendar right away. For vacations, name coverage in advance so nobody guesses who owns the phone.

How do we know if the rota is working?

Give it 30 days and track a few numbers: total pages, after-midnight pages, false alarms, response time, and how often a second person had to join. If the same harmless issue keeps waking people up, the rule is wrong or the system still needs work.

When should a small team bring in outside help?

Ask for outside help when your team keeps arguing about priorities, runbooks stay thin, or the same alerts ruin weekends month after month. A short review from an experienced CTO can clean up paging rules, handoffs, and ownership before burnout turns into turnover.