Dec 03, 2024·8 min read

Vendor outage runbook for sales, support, and ops

Build a vendor outage runbook that tells sales, support, and operations what to say, what to do first, and when to escalate.

Vendor outage runbook for sales, support, and ops

Why teams struggle when a vendor goes down

Teams rarely fall apart because the outage is too hard to understand. They struggle because everyone reacts at once, each person makes different assumptions, and nobody knows which message is official. A vendor outage runbook fixes that. Most companies only realize they need one after a bad day.

Sales usually feels the pressure first. Reps still have demos, renewals, and follow-up calls on the calendar. When a CRM, signing tool, or payment service stops working, customers want timing right away. If sales has no facts, people fill the gap with guesses. A rough estimate turns into a promise, and the rest of the day goes into walking it back.

Support runs into a different problem. Messages show up through chat, email, phone, and social channels within minutes. If agents do not have one approved update, each person writes their own version. One customer hears that the vendor is investigating. Another hears that a fix is coming soon. A third hears nothing at all. The outage is already frustrating. Mixed answers make your team look lost.

Operations often spots the issue and still waits. That sounds strange, but it happens all the time. One person checks the vendor status page, another starts an internal thread, and everyone assumes somebody else owns the response. Smaller companies feel this even more. The same person might cover ops in the morning, support by lunch, and customer calls in the afternoon.

The trust problem starts fast:

  • customers hear different answers
  • managers ask for updates nobody can confirm
  • internal teams repeat the same questions
  • simple workarounds show up too late

Customers usually understand that outside tools fail sometimes. They lose confidence when your team sounds confused, slow, or inconsistent. The outage belongs to the vendor. The customer experience still belongs to you.

Decide who owns the response

When a tool goes down, confusion spreads faster than the outage. A runbook only works when one person owns the business response and everybody else knows where to look.

That person does not need to repair the vendor's system. The job is simpler and just as important: confirm the issue, open the shared update channel, assign backups, and keep messages consistent. If two people try to do that at the same time, sales tells one story, support tells another, and ops wastes time sorting it out.

For a small team, keep the roles simple:

  • one incident owner for business teams
  • one backup for sales
  • one backup for support
  • one backup for operations

Write those names down before anything breaks. Do not rely on memory or a Slack search during a live problem. If the usual lead is away, part-time, or tied up in meetings, the backup needs full permission to make calls and send updates.

Keep vendor contacts in one shared place that the response group can reach. That usually includes the account manager, support portal details, emergency phone numbers if you have them, contract tier, and your customer ID. Put it in the same document as the runbook so nobody has to dig through old email threads.

Use one shared chat room for the outage. Keep all status notes there, even if people also talk in team channels. That room becomes the source of truth for what happened, what the vendor said, which workarounds are active, and when the next update goes out.

One extra rule helps more than people expect: only the incident owner posts official internal updates, unless that job is handed to the backup. Under pressure, that single rule cuts down a lot of noise.

Run the first 30 minutes step by step

Most teams lose time in the first ten minutes because too many people start guessing. A calm start matters more than a fast guess.

First, confirm that the issue is real and figure out how wide it is. Check two sources before you call it an outage: your own failed actions, logs, or screenshots, plus something outside your team, such as the vendor's status page, a support reply, or the same failure reported by another department.

Then map the business impact in plain language. If the tool touches sales, support, and operations, write down what stops working right now. Keep it concrete: sales cannot update deals, support cannot see account history, ops cannot process orders.

A practical first half hour usually looks like this:

  • Minute 0 to 5: verify the problem from two sources and capture one screenshot or error message.
  • Minute 5 to 10: list affected teams, blocked tasks, and customer-facing risk.
  • Minute 10 to 15: send the first internal update, even if you know very little.
  • Minute 15 to 20: assign one person to contact the vendor and keep replies in one place.
  • Minute 20 to 30: stop making recovery promises and focus on workarounds and customer impact.

That first internal update should be short. Say what is down, who is affected, what people should stop doing, and when the next update will arrive. If you do not know the cause yet, say that clearly. People handle uncertainty better than fake confidence.

Recovery time is where teams often get themselves into trouble. Do not repeat a random estimate from chat. Do not tell sales to promise a fix by noon because it feels likely. Until the vendor gives a solid answer, say the team is investigating and the next update will come at a set time.

Pick one person to deal with the vendor. That person asks questions, tracks answers, and pushes for updates if the issue grows. Without that, you get five separate messages, mixed replies, and missed details.

If the outage hits a tool that supports customer work, ask each team lead for one temporary workaround. Sometimes that means a spreadsheet, a delayed non-urgent task, or a backup channel for customer replies. Small moves in the first 30 minutes can save hours later.

Write updates people can send fast

A fixed format cuts delay. When a tool goes down, teams waste time deciding what to say, and then every group sends a different version. Your outage communication plan should give people one short update they can copy, trim, and send in under a minute.

For internal messages, keep it to five lines in the same order every time.

Service impacted:
What still works:
What fails:
What changed since last update:
Next review time:

That format does two jobs at once. It tells sales what they can still promise, tells support what to stop suggesting, and tells operations whether the issue is spreading or holding steady. 'What changed' matters because 'no change since 10:30' is still useful. It saves people from rereading old chat threads just to guess the current state.

Support also needs a plain customer reply that sounds calm and clear. Skip internal jargon, ticket numbers, and vendor drama. A good reply says you know about the issue, names the visible effect, gives a workaround if one exists, and says when the customer should expect the next update.

We are seeing an issue with [tool name]. Some customers cannot [action]. [Working alternative, if any]. We are tracking it and will send the next update by [time].

Every message should include the next review time, even if you expect no news. That small line cuts repeat questions because people know when to check again. Pick a real time, not 'soon' or 'as available'. If you miss it, send a short note and set the next one.

Keep guesses out of the updates. If you do not know the cause, say you are investigating with the vendor. Do not write that their database is probably down or that their team broke something. Blame spreads fast inside a company and reaches customers even faster. Facts travel better: what users can do, what they cannot do, and when you will speak again.

Set workarounds for each team

Bring in a Fractional CTO
Get senior technical help with runbooks, vendor risk, and day to day operations.

When a tool goes down, people still need a way to keep work moving. The best workaround is usually boring. Write down the minimum safe process for each team and keep it easy enough to use under pressure.

Sales, support, and operations do not need the same backup plan. If you give everyone one generic outage note, they will improvise, and that creates more cleanup later.

Team by team fallback

For sales, use a shared sheet when the CRM is unavailable. Reps can log account name, contact, deal stage, next step, and any promise made to the customer. That is enough to protect active deals without asking the team to rebuild the whole CRM by hand.

For support, pick one backup intake path before you need it. A shared mailbox often works well. A simple backup form can work too if the main help desk is down. Ask agents to capture only the basics: customer name, contact details, issue summary, severity, and any deadline.

For operations, switch to manual approval steps for the tasks that cannot wait. Keep the rule simple. If the normal tool handles order approval, access requests, or payment checks, name who reviews them manually and where that decision gets recorded.

A short checklist helps:

  • keep one temporary place for each type of record
  • assign one person to watch for duplicates
  • record who approved what and when
  • flag urgent items that need follow-up after recovery
  • stop low-value tasks that can wait

Every workaround should include a cleanup marker. Add a simple tag, note, or column such as 'entered during outage' so the team can find those records later. Without that marker, cleanup turns into a scavenger hunt.

It also helps to say what people should skip. During an outage, staff often waste time trying to preserve every small step from the normal process. If a task does not affect revenue, customer promises, compliance, or security, pause it and note it for later.

A good runbook keeps these workarounds short and specific. If a new team member can follow the backup process in two minutes, it will probably hold up when the pressure starts.

Know when to escalate

Teams often wait too long because they assume the vendor will recover in a few minutes. That delay gets expensive fast. Your vendor escalation process should draw a clear line between monitoring and escalating based on business impact, not hope.

Escalate as soon as the outage blocks revenue work. If sales cannot send quotes, renewals, or contracts, someone senior needs to know right away. The same goes for support that cannot process refunds or operations that cannot move orders forward.

You should also escalate the moment data looks unsafe. If records might disappear, sync twice, or create duplicates, stop treating the issue as a normal delay. A slow tool is annoying. Bad data creates cleanup work for days.

Use a simple rule:

  • escalate when customer-facing or revenue work stops
  • escalate when data may be lost, delayed, or duplicated
  • escalate when the vendor misses its own promised update time
  • escalate when a manager must approve manual work or exceptions

That missed-update trigger matters more than most teams think. If a vendor says the next update will come in 30 minutes and nothing arrives, move to the next step in the escalation process. Silence often creates more confusion than the outage itself.

Manager approval is another clear line. If support wants to log requests in a shared sheet, or ops wants to re-enter orders by hand later, a manager should approve that choice. Manual work can help for an hour, but it can also create errors, refunds, and audit problems.

Write down every escalation while it happens. Log the time, the reason, who approved it, and what action the team took. Names matter here. 'Approved by Sarah, sales director, at 11:20' is far better than 'manager approved.'

Picture a CRM sync failure during a busy afternoon. Sales cannot update deals, support sees duplicate account notes, and the vendor misses its 2:00 update. That is no longer a watch-and-wait problem. The team should escalate, pause risky manual changes, and record who made each call.

A simple outage example

Fix Fragile Team Handoffs
Make sales, support, and ops follow the same plan from minute one.

At 2:10 p.m. on a Tuesday, the CRM stops loading right as new demo requests start coming in. Sales cannot add leads, support cannot see recent customer notes, and ops loses the usual view of order handoffs. This is where a runbook stops panic from spreading.

The incident owner confirms the problem within a few minutes. Two sales reps report the same error, support sees failed lookups, and the vendor status page shows an active incident. From that point on, one person owns updates and checks the vendor page every 15 minutes so the rest of the team can keep working.

Sales does not wait for the CRM to return. The team switches to a shared sheet with a few required columns: time, lead name, company, contact details, source, and owner. That is enough to avoid losing demand during a busy afternoon. When the CRM comes back, one person can import or re-enter the list in order.

Support keeps its message simple. Agents tell customers that the CRM is down, but email support still works and the team is still replying. They do not guess at a fix time. They log new conversations in the help inbox and keep a short note with the customer name, issue, and promised follow-up.

Ops takes a different path. The team makes a manual list of orders that need action before the end of day. For each order, they note the customer, current step, blocker, and next action. If a shipment, approval, or refund needs a person to step in, ops assigns that work directly instead of waiting for normal automations.

A short internal update might read like this:

'CRM outage confirmed at 2:10 p.m. Sales is logging leads in the shared sheet. Support is handling requests by email. Ops is tracking manual order follow-up in the incident tracker. Next update in 15 minutes.'

This example works because each team gets one clear fallback. Nobody tries to do everything. The incident owner watches the vendor, the teams protect current work, and the business avoids a bigger mess than the outage itself.

Mistakes that slow the team down

A runbook fails fastest when the company sounds like five different teams. One person tells customers the issue is minor, another says the vendor is fully down, and a third says nothing at all. That confusion spreads faster than the outage.

Mixed messages usually start with good intent. Sales wants to calm a deal, support wants to answer tickets, and operations wants people to stop asking questions. If they all write their own updates, customers lose trust and internal teams start arguing about which version is correct.

Recovery promises cause a different kind of damage. Somebody says it should be back in 30 minutes because they want to be helpful, but the vendor never confirmed that. Then the clock runs out, customers get angrier, and your team has to explain why it guessed.

Retrying broken actions is another common mess. A rep clicks submit three more times, an agent resends an order, or ops re-imports the same file. When the tool comes back, you may have duplicate records, duplicate charges, or tickets tied to the wrong account. One outage turns into cleanup that lasts all day.

Manual work creates trouble too when nobody writes it down. If support refunds a customer outside the usual system, or ops tracks changes in a temporary sheet, someone needs to log exactly what changed. Without that record, the team cannot reconcile data later, and finance or account managers end up guessing.

Then comes the quiet ending. The system returns, everyone feels relief, and the team moves on. Customers still wait for confirmation, front-line staff still use workarounds, and managers still think the incident is open because nobody sent a recovery update.

A few habits prevent most of this:

  • give one person ownership of outgoing updates
  • ban time estimates unless the vendor confirms them
  • pause retries until the team approves a safe workaround
  • track every manual action in one shared place
  • send a clear 'service restored' message when work returns to normal

Small mistakes pile up fast during an outage. Clean communication and disciplined logging save more time than rushed heroics.

Quick checks before you publish the runbook

Set Clear Escalation Rules
Define when sales, support, ops, and managers need to step in.

A runbook can look fine in a document and still fall apart in a real outage. Test it the way people will use it under stress: quickly, with little context, and without the author in the room.

Start with a simple standard. If someone new cannot follow it alone, it is not ready.

Give the draft to a new hire and ask them to handle one sample outage from start to finish. Watch where they pause, guess, or ask for help. Put one owner on every action. Avoid lines like 'team checks status.' Write 'support lead checks vendor status page' or 'ops manager calls backup contact.'

Read every template out loud. If a message sounds stiff or vague, rewrite it in plain words people would actually send to a customer or coworker. Check that business continuity steps cover each group. Sales may need a backup demo flow, support may need approved reply text and refund guidance, and ops may need a manual process to keep orders moving.

Review contacts and backup owners every quarter. Vendor reps change, internal roles shift, and old phone numbers fail at the worst time.

A small example makes this easier to test. If your CRM vendor goes down, can sales still log leads somewhere, can support explain the issue in one short message, and can ops keep the handoff moving without waiting for a manager? If one team has no fallback, the runbook still has a hole.

This review should feel a little picky. That is the point. Clear names, plain words, and current contacts save time when people are tired and customers want answers fast.

What to do next

Pick one vendor that would hurt most if it failed for half a day. Payment, CRM, support chat, shipping, or e-signature tools are common starting points. Then draft the first version of your vendor outage runbook this week.

Do not wait for perfect wording. A rough document people can use today is better than a polished file nobody opens during an outage.

Start small:

  • name one incident owner
  • write one internal update and one customer update
  • add one fallback action for sales, support, and ops
  • set one point where the team contacts the vendor and one point where a manager joins

After that, test it in a 20-minute tabletop exercise. Put a few people on a call and give them a simple prompt: 'The vendor is down right now. Show me what you do in the first 10 minutes.'

Pay attention to hesitation. Maybe support does not know what it can promise. Maybe sales sends mixed messages. Maybe ops knows the workaround, but nobody else does. Fix every step that makes people stop, guess, or ask who owns the decision.

Keep the edits small and practical. If a step is too long to follow under pressure, shorten it. If a workaround depends on one person being online, replace it with something the team can actually use.

If outages keep exposing the same gaps, the problem is larger than one runbook. The escalation path may be fuzzy, the backup workflow may be weak, or too much knowledge may live in one person's head. In cases like that, an outside Fractional CTO can help tighten the process. Oleg Sotnikov at oleg.is works with startups and smaller businesses on technical operations, backup workflows, and AI-first systems, which can make outage response much easier to manage.

A good runbook is not big. It is clear enough that your team can use it fast when the next outage hits.