Oct 23, 2025·8 min read

Who makes the call during an outage on a small team

Who makes the call during an outage on a small team? Learn how to split diagnosis, customer updates, and rollback authority to stop chat loops.

Table of Contents

Why small-team outages turn into chat loops

Most small teams do not fail during an outage because they lack skill. They fail because one chat turns into five jobs at once. People post logs, ask for updates, suggest fixes, guess at causes, and debate whether to roll back. After a few minutes, nobody can tell which message needs action and which one is just noise.

The biggest drag is split attention. The same engineer often tries to debug the issue, answer the founder, calm support, and write a customer update. That sounds efficient, but it usually slows everything down. When someone keeps switching from stack traces to status messages, they miss clues, repeat checks, and lose the thread.

A second problem is fuzzy authority. Many startups never decide who can stop a bad release, who can approve a rollback, or who owns the customer message. So the team waits for group agreement. In practice, that means four people type at once and no one acts. Users stay blocked while the team asks, "Should we revert?" three different ways.

This is why the question "who makes the call during an outage" matters so much on a small team. If nobody owns the call, everyone tries to own part of it. That creates loops. One person says the database looks fine, another suspects the deploy, a third asks whether support should say anything yet, and the original issue still sits there.

A simple example shows the pattern. A five-person SaaS team pushes a release on Friday afternoon. Logins start failing. Two engineers inspect auth tokens, the founder asks for an ETA in the same chat, and support wants wording for affected customers. Meanwhile, no one knows who can hit rollback without approval from the rest. Ten minutes of work becomes forty minutes of discussion.

Chat loops are not really a communication problem. They are a decision problem disguised as conversation.

The decisions you need to separate

When teams ask who makes the call during an outage, they usually bundle three separate calls into one. That is why the chat fills with theories, half-written customer updates, and arguments about rollback. Split the work, and the noise drops fast.

The first decision is about cause. One person owns diagnosis and says what the team knows, what it does not know, and what to test next. Nobody else should run side quests in the main thread unless that person asks for help. If five people guess at once, you do not move faster. You just create five versions of the story.

The second decision is about customers. This person does not need the full root cause before speaking. They need three facts: what users can see, when the issue started, and when the next update will come. "Checkout is failing for some users since 10:12. We are working on it. Next update in 15 minutes" is enough. Clear beats complete.

The third decision is about recovery. Someone must decide whether the team should roll back, pause changes, or keep digging. That choice is different from diagnosis. The engineer closest to the logs may still want ten more minutes. The person responsible for service recovery may decide that ten more minutes costs too much and roll back now.

A simple split looks like this:

Diagnosis owner: finds the current cause and directs technical checks
Customer update owner: posts updates on a fixed schedule
Recovery owner: decides rollback, pause, or continue investigation
Tie-breaker: makes the final call when two owners disagree

The tie-breaker matters more than most small teams think. If the diagnosis owner wants to keep digging, but the recovery owner wants to roll back, one person must end the debate in under a minute. On a small team, that is often the CTO, founder, or the most senior engineer on call.

Picture a bad deploy that breaks login. One engineer checks logs and recent changes. Another sends the first customer update and sets the next one for 15 minutes later. The incident lead decides whether to revert the deploy. Four people can help, but only these owners make those calls. That keeps the outage moving forward instead of spinning in chat.

Pick one person for each job

When an outage hits, one problem turns into three very fast. Someone needs to find the cause. Someone needs to tell customers what is happening. Someone needs to decide whether to roll back. If one person tries to do all of that, the fix slows down. If three people try to do each job at once, the chat fills with guesses.

A small team does better with named roles. The titles matter less than the split.

The diagnosis lead follows the clues and picks the next check.
The update owner writes status messages and keeps them short, clear, and factual.
The rollback owner decides when to revert and says yes or no.

The diagnosis lead should not stop every five minutes to explain half-formed theories. They need space to test one path, rule it out, and move to the next. Good incident chats are boring on purpose. One person says what to check next. Everyone else either helps with that check or stays out of the way.

The update owner protects that focus. They gather confirmed facts, turn them into plain language, and send the same message everywhere customers might hear from you. That keeps support, sales, and founders from pulling engineers into side chats.

Rollback authority needs a separate owner because this call is emotional. Teams often wait too long because nobody wants to admit the release caused the outage. A clear owner cuts through that. They listen to the diagnosis lead, but they make the final decision.

Add one backup for each role. Small teams have sleep schedules, meetings, school pickup, and plain bad timing. If the main person disappears, the backup takes over without a debate.

Keep the same split across tools and shifts. If your team uses chat, phone, email, or an incident channel, the same person owns the same job in each place. When the shift changes, hand off the role by name. That is how you answer who makes the call during an outage before the pressure starts.

The first 15 minutes

Most outage damage in a small team comes from noise, not just from the bug. Three people start checking logs, someone pushes a "quick fix," and nobody writes down what changed. The first 15 minutes should feel a little strict.

Start by confirming two things: did the alert fire for a real reason, and what are users feeling right now? "CPU is high" is not enough. "Checkout fails for 40% of users" is useful. If you can name the user impact in one sentence, the team stops arguing about whether this is urgent.

Then freeze anything that can make the picture worse. No new deploys. No config edits. No "I already had a patch ready." If a team needs to keep moving fast during normal work, it needs to slow down on purpose during an outage.

A simple rhythm works well:

Minute 0 to 2: verify the alert and check one customer-facing symptom.
Minute 2 to 4: pause deploys and risky changes.
Minute 4 to 7: open one incident note and add time stamps for every action.
Minute 7 to 10: give each person one check only, such as database health, recent deploy, or third-party status.
Minute 10 to 15: send a short customer update and set the next decision time.

That incident note matters more than people think. One shared record cuts repeat work and stops the team from circling the same question. Write plain facts: "10:04 alert fired," "10:06 deploys paused," "10:09 API errors rose after cache restart."

Keep the team on one-question moves. Ask one person to inspect the last deploy. Ask another to test the main user path. Ask a third to check infrastructure health. Do not ask everyone to "look around."

Send the first customer update early, even if it is brief: "We see the issue, we are investigating, next update in 15 minutes." That buys trust and reduces pressure in the incident chat.

Before the 15 minute mark ends, pick the next decision point. For example: "If sign-in errors stay above 20% by 10:15, we roll back." That is the moment when who makes the call during an outage stops being vague and becomes real.

Customer updates that do not slow the fix

Bring In a Fractional CTO

Get experienced technical leadership for incident process, architecture, and lean delivery.

Talk to Oleg

Customers do not need your theory during an outage. They need a plain description of what they can feel right now. Say "some users cannot log in" or "checkout fails after payment". Do not guess at the cause until the team confirms it.

A good update has two facts and one promise: the current impact, what still works if you know it, and the next update time. That gives people something solid to act on. It also cuts down the flood of follow-up questions in Slack, email, and support chats.

One person should own every external and internal update. Everyone else stays on diagnosis, mitigation, or rollback. If three engineers start wording status messages, the fix slows down and the wording drifts.

Support and leadership should get the same base message. Support can reuse it with customers. Leadership can make business calls from the same facts. When each group hears a different version, people start asking the incident channel to "clarify," and the loop starts again.

Short updates work best:

What users see: errors, delays, missing data, failed actions
Who is affected: all users, new sign-ins, one region, paid plans only
What happens next: investigating, testing rollback, partial recovery, next update time

Skip lines like "we are looking into it." They say almost nothing. A better message is: "Some users cannot log in. Existing sessions still work. We are testing a rollback. Next update in 15 minutes."

If you do not know the cause yet, say that directly. Clear uncertainty is better than a wrong answer. "We see elevated checkout failures and are narrowing the change that triggered them" is fine. "This may be a database issue" is not, unless you have proof.

Keep the rhythm steady even when nothing changes. A short note at the promised time builds trust: "Impact is unchanged. Rollback is still in progress. Next update at 11:30." Silence usually creates more pressure than bad news.

When to roll back and who says yes

If the newest deploy lines up with the first spike in failures, roll it back fast. Do not turn the incident chat into a debate club while users fail to sign in, pay, or finish the one task they came for.

A small team needs one person with clear rollback authority. In many startups, that is the tech lead, CTO, or the engineer on call. The title matters less than the rule: one named person can say yes without waiting for group approval.

Keep the trigger simple. If the latest change matches the failure and users are stuck on a core action, rollback should be the default move.

Set the trigger before the outage

Pick a few numbers your team will trust under stress:

error rate above your normal range for 5 to 10 minutes
sign-in failures that block new or returning users
payment or checkout errors that stop revenue
a clear count of lost orders, failed jobs, or stuck sessions

These thresholds do not need to be perfect. They need to be clear enough that nobody argues about them while the site burns.

Keep digging only when rollback creates more risk than the outage itself. That usually means the release changed data in a way you cannot safely undo, ran an irreversible migration, or fixed a security issue you cannot re-open. In those cases, the rollback owner should say that out loud in one sentence and keep the team on the live fix.

A short rule works well: if rollback is safe and the latest change is the likely cause, rollback first and investigate on a stable system. If rollback is unsafe, contain the damage and push the fastest safe fix.

If you still debate who makes the call during an outage, decide before the next deploy. Write one line in your runbook with the owner’s name, backup, and thresholds. That single line can save 20 minutes of looping, and 20 minutes is a lot when checkout is down.

A simple outage example

Build a Lean Runbook

Create a short outage playbook your team will actually use under pressure.

Book Session

At 10:02, a small startup team ships a checkout change. By 10:03, the payment error rate jumps hard, and new orders start dropping. Support sees the same complaint twice: customers can fill the cart, but the final payment step fails.

The team does not open a free-for-all chat. They split the work.

Mia owns diagnosis.
Jon owns customer updates.
Leah owns rollback authority.

Mia starts with facts, not guesses. She compares logs, metrics, and the last release. The new deploy touched checkout totals, so she checks whether a bad tax or discount value is breaking the payment request. While others throw out theories in chat, she stays in the data. That saves time.

Jon sends the first status note at 10:05. He keeps it plain: "We are seeing checkout errors after a recent release. We are working on it now. Next update at 10:10." He sends the same message to the internal team and the customer-facing channel. Nobody asks engineers for custom wording every two minutes, which matters more than most teams admit.

At 10:07, Mia tries one fix. She rolls back a config value tied to the new checkout logic. It fails. Errors stay high. By 10:11, Leah makes the call to revert the full release. Nobody votes. Nobody starts a side debate about whether they should wait another five minutes. She owns rollback authority, so she says "revert now," and the team does it.

Recovery does not end the incident. The team watches success rates, logs, and real orders for the next several minutes. When checkout stays stable and fresh payments keep clearing, they hold deploys a bit longer instead of pushing a rushed patch.

That is who makes the call during an outage on a small team: one person for diagnosis, one for updates, one for rollback. If all three jobs sit in one noisy chat, the chat becomes the problem.

Mistakes that slow everyone down

If nobody has decided who makes the call during an outage, the same pattern shows up fast. The team opens a chat, everyone jumps in, and work spreads in five directions at once.

The first problem is duplicate investigation. Three people pull logs, restart services, and test theories at the same time, but nobody knows who owns diagnosis. That feels busy. It rarely moves the service back faster. It mostly creates noise, repeated questions, and conflicting guesses.

The next problem is communication by volume. The loudest person often takes over customer updates, even if they do not know the latest facts. Then support, sales, or founders repeat half-checked details. Customers get mixed messages, and engineers lose time correcting them instead of fixing the issue.

A more dangerous version happens when one person pushes a patch while someone else prepares a rollback. Those are opposite paths. If both move at once, the team can erase evidence, ship a partial fix, or roll back a change that was not the cause. One person needs authority to choose the path and stop the other one.

Blame is another time sink. Small teams do this more often than they admit. Someone asks who merged the change, who approved it, or why this was missed. That discussion can wait. During an outage, blame steals attention from recovery.

The quietest mistake is poor recordkeeping. Nobody writes down the time the issue started, the exact symptom, who owns diagnosis, who talks to customers, or who approved the rollback. Twenty minutes later, the chat is full of opinions but missing facts.

A team is probably stuck if you see this:

two or more people testing different fixes without a clear owner
customer messages going out before the team agrees on the current status
a patch and rollback moving in parallel
debate about fault before the service is stable
no simple timeline in the incident chat

A short incident note fixes more than most teams expect. Write the time, owner, current theory, next action, and last customer update. Five lines can cut twenty minutes of looping.

A short checklist for every incident

Fix Noisy Incident Chats

Turn outage discussion into clear roles, faster decisions, and one shared incident note.

Start Now

If nobody answers "who makes the call during an outage" in the first few minutes, the same argument repeats all hour. A short written checklist stops that. Keep it in one fixed place and use the same order every time.

Name the diagnosis lead. One person owns the technical picture right now. Everyone else sends that person logs, test results, and observations.
Name the customer update owner. One person writes the next message so customers do not get mixed signals from support, product, and the founder.
Name the rollback approver. One person can say yes or no fast if the team needs to back out a release, disable a feature flag, or restore an older version.

On a very small team, one person may hold two of these roles. One person should not hold all three. That is how incident chats turn into guesswork and delay.

Write the customer impact in plain language. Skip vague notes like "site issues" or "service disruption." Say what action fails: checkout fails after payment, login codes do not arrive, dashboards do not load, or API requests time out. If the problem only hits one region, one plan, or one browser, record that too.

Then set the next decision point. Do not leave it open-ended. Pick a time such as "10:20 we decide whether to roll back" or "in 15 minutes we choose between deeper diagnosis and mitigation." A timed decision keeps the team moving.

Record facts and times in one shared place. A single incident doc, ticket, or pinned chat message is enough.
Log only what the team knows: first alert time, customer reports, changes made, tests run, and decisions taken.
Add timestamps for every action. Later, those five or six lines often explain the whole outage faster than memory does.

This checklist takes less than two minutes to fill out. It usually saves much more than that, especially when pressure rises and everyone wants to help at once.

Next steps for a small team

A small team does not need a thick incident playbook. It needs a few clear rules that people can remember under stress. If you want to stop the same outage chat from looping again, make one change after each incident, not ten.

Keep the review short and factual. A 15 minute debrief is enough for most startup incidents. Write down the timeline, who made each decision, when customer updates went out, and what delayed the fix. Skip blame. Blame turns a useful review into a quiet room where nobody says what actually broke.

Then fix one unclear handoff before the next incident. Pick the point where people hesitated or talked over each other. Maybe the engineer debugging the issue also tried to write customer outage updates. Maybe nobody knew who could approve a rollback. Choose one of those gaps and close it.

A single page often does more than a long document. Put three things on it:

who leads diagnosis
who sends customer updates
who can approve a rollback, and at what threshold

That last part matters. Do not wait until production is down to debate rollback authority. Write simple thresholds such as: roll back if the last deploy caused the issue, if error rates jump after release, or if the team cannot confirm a fix within 10 minutes. Simple rules beat clever ones.

Run one practice drill on a quiet day. Keep it small. Use a made-up outage, a 20 minute timer, and the same chat channel you would use in real life. You will spot awkward gaps fast. People usually find them in the first five minutes: two people posting updates, nobody owning the decision, or everyone asking who makes the call during an outage.

A concrete example helps. If your last incident stalled because the founder, lead engineer, and support person all weighed in on rollback authority, assign one owner now and write down when they act without asking for consensus. That single change can save 20 minutes in the next outage.

If your startup needs a cleaner incident setup, Oleg Sotnikov can help define simple roles, rollback rules, and lean operating habits as a fractional CTO. The useful version is not a big process. It is a small system your team will actually use at 2 a.m.