Break glass account drills before the next SSO outage
Break glass account drills help teams prove emergency access before an outage. Learn who joins, what to test, and how to fix gaps fast.

Why backup access fails when SSO stops
When SSO breaks, the damage spreads fast. People do not lose one login. They lose email, chat, VPN, cloud consoles, admin panels, and sometimes the ticket system too.
That creates a bad kind of silence. The people who should fix the outage cannot reach each other, cannot read alerts, and cannot open the tools that hold the recovery steps.
A lot of teams think they have emergency access because a document says they do. Then the document lives in a wiki behind the same identity provider that just failed. Or the recovery codes sit in a password manager that also needs SSO. The plan exists on paper. In practice, nobody can use it.
Communication often fails the same way. Teams keep the contact list in company email or chat, both of which depend on the broken login path. Now nobody knows who owns the backup accounts, who can approve their use, or who can confirm a risky change.
The failure patterns are usually boring. The backup password expired months ago. MFA codes go to an inbox nobody can open. The only person who knows the process is on vacation. The account still exists, but nobody knows what it can actually reach.
Access also fails because teams test the account in isolation instead of testing the whole path. Logging in to one cloud console proves very little if the same person also needs a VPN, a bastion host, a database admin page, and a way to tell the rest of the team what changed.
Small gaps become full lockouts. One missing phone, one stale phone number, or one disabled account can delay recovery for hours.
Picture a simple case. The identity provider fails at 9:10 a.m. The ops lead tries the backup admin account, but MFA goes to company email. The runbook sits in the company wiki. The network admin can help, but chat is down and nobody has her personal number. A short auth issue just turned into a business outage.
That is why break glass account drills matter. They tell you whether the account works, whether the people can use it, and whether the plan still holds up when normal access disappears.
Start with the systems that matter most
Begin with the systems that can stop most of the company within minutes. If people cannot sign in, message each other, reach servers, or reset access, the rest of the plan does not matter much.
A good first pass is simple. Ask two questions: which tools freeze work for almost everyone, and which tools help you recover the rest? Those are not always the same tools. Email might block daily work, but your identity admin console or secret store may decide whether you can fix anything at all.
For most teams, the first drill should cover communication, remote access, cloud administration, shared secrets, and identity tools. If downtime hits customers quickly, move customer-facing systems higher on the list. That often includes billing, support, deployment tools, a status page, and anything the team needs to stop a bad release or answer urgent tickets.
This is where teams usually find an ugly dependency. They create backup admin accounts for a cloud provider, but the recovery codes live inside the same vault that uses SSO. Or they protect emergency access with MFA tied to a device managed through the identity system that just failed. On paper, access exists. In practice, nobody can use it.
If your team runs modern ops tooling, include the places where people actually restore service. That may be a cloud console, GitLab, CI runners, DNS, or observability tools such as Grafana and Sentry. You do not need every system in the first drill. You do need the few that let you see the problem, reach production, and make changes safely.
A simple label helps. Mark each system as one that blocks work, restores access, or protects customers. Some systems fit two labels. Those should go first because they matter twice when SSO fails.
Pick the right people
Emergency access tests fail when the wrong people show up. Use the same people who would respond during a real outage, not stand-ins who know the script but would never touch production.
Give each system one named owner. That person does not need to do every task alone, but they do need to own the result. For every system in scope, someone should answer a plain question: who signs in, checks permissions, and confirms the system still works?
A small group usually works best. One person should cover identity or directory systems. Add one owner for each app, cloud account, or admin console in the drill. Include a note taker who records time, blockers, and missing instructions. Bring in a manager only if approvals, audit rules, or legal limits affect access.
Keep the room small. Four to six people is often enough. Bigger groups slow the drill down and hide confusion because everyone assumes someone else is handling it.
The note taker matters more than people think. They should record how long each step takes, where people get stuck, which secrets or devices are missing, and whether anyone relied on tribal knowledge instead of the written process. That record tells you much more than a simple pass or fail.
If access depends on approval, test that too. Some teams cannot use backup admin accounts, enter production, or view certain data without sign-off. If that approval path is part of real life, it belongs in the drill. Otherwise, you are testing an easier version of the process.
One mistake keeps coming back. A senior admin runs the whole drill alone, succeeds, and everyone relaxes. Then that person is asleep, on a flight, or no longer with the company when the outage happens. Pick the people who would actually be available, and make them do the work themselves.
A good rule is simple: everyone in the room should perform an action, approve an action, or document an action. If they do none of those, leave them out.
Run the drill like an outage
Treat the exercise like the real thing. If people can quietly fall back to normal SSO, the test tells you almost nothing. Start at a fixed time, announce that normal sign-in is unavailable, and have everyone use only the emergency path until the drill ends.
Timing matters. Start a shared timer the moment the notice goes out. Have one person write down what happens as it happens, not from memory later.
Use a real device for each login. That means the same browser, laptop, phone, or VPN setup the person would use during an actual outage. A spare lab machine often hides the exact problems that break recovery on normal work devices.
Then retrieve the backup username, password, and instructions the same way you would in real life. If that means opening a sealed file, using a hardware token, getting into a vault, or asking another admin for part of the process, test that path too. Any delay counts.
Check the second factor before moving on. Make people complete MFA, try recovery codes if needed, and confirm they can reach the phone, app, or token that holds the code. If the code goes to a system tied to the broken identity provider, you just found a weak spot.
After login, do one safe action in the real admin interface. Read audit logs, open the user list, or change a harmless test setting and change it back. You need proof that the account can do the job, not just sign in.
Write down every delay, error, and surprise. Expired passwords, missing permissions, locked accounts, stale instructions, and long waits all matter.
Do not rescue people too early. If someone needs a teammate to explain where the code is stored, that is part of the result. If an account logs in but cannot reach the admin screen, the drill failed for that system.
A useful drill usually leaves you with a messy list. That is good. You found the gaps on a calm day instead of during a real outage.
A simple outage example
At 7:40 on a Monday, people start getting login errors before the workday even begins. The company portal that handles sign-in is down, so staff cannot open email, chat, or internal apps with their usual account.
By 8:00, the support team has a bigger problem. Customers can still send requests, but agents cannot open the ticket system because it uses the same identity provider. Phones ring, inboxes fill up, and nobody on the team can see queue status or assign work.
One admin avoids the usual login path and signs in to the cloud console with the emergency account kept for this situation. This is where a break glass drill either pays off or falls apart. If the password is old, the MFA device is missing, or nobody remembers where the runbook lives, the outage keeps spreading.
In a good drill, the admin gets in and the team follows a simple restore order: email first so leaders and support staff can coordinate, then the ticket tool so customer work starts moving again, and then deployment access so engineers can fix app issues if the outage causes follow-on problems.
That order matters. If engineers rush to restore build or production access first, support and operations still work blind. Restoring communication early often saves more time than diving straight into deeper systems.
The test also uncovers a very ordinary problem. One of the backup recovery phones is sitting in a locked office, and the office manager is on vacation. On paper, the company has a second factor. In real life, nobody can reach it before business hours.
That is why a real drill beats a checklist. You learn who can act at 7:30, which systems fail together, and what still works when SSO does not.
Common failure points
Most emergency access plans fail in plain, predictable ways. The team has a spreadsheet, a named owner, and a polished runbook, but nobody tries the login on a real system until the day SSO is down.
That gap is bigger than it looks. A backup admin account can exist on paper and still fail because the password changed, the account lost its role, or the login page now forces a new step that nobody wrote down. Break glass account drills only work when real people sign in to real tools and complete the full path.
A common mistake is storing the backup secret inside the same vault that depends on the identity system you are trying to bypass. If SSO locks everyone out, the password may still be safe and completely useless. Keep at least one recovery path outside that chain, with clear controls and a very small group of trusted people.
Teams also create single points of failure without noticing. One person has the only MFA token, the only recovery email, or the only phone that can receive the code. If that person is asleep, offline, or gone, the backup plan stops there.
Expiry causes quiet damage. Phone numbers change. SIMs get replaced. Licenses lapse. Admin roles disappear after an audit. Nobody sees the problem until someone actually needs the account.
The runbook often misses tiny steps that block the whole drill. The admin portal only allows sign-in from the office VPN. The backup account needs manager approval before privilege elevation. A firewall rule blocks the jump host used for recovery. The MFA reset process still sends codes to an old number. The account can sign in, but it cannot reach the billing or identity settings pages.
These are not edge cases. They are the usual failures.
Ownership is another weak spot. When everyone assumes somebody else updates the emergency account, nobody does. Give one person responsibility for keeping the record current, and give a second person enough access to verify the work.
If you want an outage plan that holds up, treat backup access like a live control, not a document. Test it, watch where people hesitate, and fix every missing step while normal login still works.
Checks before you sign off
A drill is not done when someone says, "looks fine." It is done when the team proves they can get in from the place they will actually work, with the tools they will actually have during an outage.
Start with shared access, not solo access. Every emergency account should be reachable by at least two named people. If one person is on a flight, asleep, or locked out too, the account still needs to be usable.
Next, check whether the team can find the credentials without normal sign-in. A sealed password vault entry is no help if the vault depends on the same identity provider that just failed. Teams miss this all the time, then learn the recovery path has the same weakness as production access.
MFA is another trap. A backup account is not really a backup if the second factor lives on one lost phone. Test spare devices, hardware keys, or recovery codes. Put a real person through the full login flow and make sure they can finish it without borrowing somebody else's device.
Network access matters just as much. Admins should reach each login page from the network they will use in a real incident, whether that is a company laptop on VPN, a home connection, or a secured jump box. Many teams test the password and forget that the login page itself is blocked by DNS rules, VPN policy, or an office-only firewall.
Before closing the drill, confirm five things. Two people can use each emergency account. The team can retrieve credentials without the usual SSO path. MFA works with spare hardware or stored recovery codes. Admins can open the login page from the expected network. The runbook shows which systems come back first.
That last point saves time when stress is high. If email, cloud control, VPN, and endpoint tools all depend on identity, the team needs a clear restore order. Write it down in plain language. "Get cloud admin access, then DNS, then VPN, then email" is much better than a vague note that says to restore services as needed.
If one of these checks fails, the drill found something useful. Fix it now, then test that part again.
Repeat the test
A break glass account drill is not a one-time project. People leave, phones get replaced, MFA apps reset, and permissions drift. Six months later, the account that worked in the last test can fail for a very ordinary reason.
Run a short check any time the team changes in a way that affects access. That includes an admin leaving, a new admin taking over, a phone number change, an MFA reset, or a password vault migration. These checks do not need to take an hour. A 10 to 15 minute login test catches most problems early.
Set a fixed date for the full drill and treat it like any other operational test. Quarterly works well for many teams. If your systems change often, monthly may be safer.
The rhythm should stay simple. Do a quick access check after staffing or MFA changes. Run a full drill on the calendar, not only after an incident. Record every point of confusion during the test. Fix stale accounts and bad instructions the same day.
Old accounts create false confidence. The name still appears in the admin panel, so everyone assumes access is covered. Then the person left months ago, or the mailbox behind the account no longer exists. Remove those accounts as soon as you find them. If a runbook step is vague, rewrite it while the test is still fresh.
Keep two versions of the plan
Leaders usually need one short page. It should say who declares the outage, who approves emergency access, which systems come first, and how the team confirms recovery. That is enough for decisions under pressure.
Admins need the full runbook. Include exact account names, MFA recovery steps, vault locations, support contacts, rollback notes, and the order of systems to test. If one instruction makes a trained admin stop and guess, the runbook is still incomplete.
A small example makes the risk obvious. A company changes its MFA provider and updates the production login flow, but nobody updates the emergency access notes. The next SSO outage turns a five minute recovery into a 90 minute scramble. Regular drills stop that kind of avoidable failure.
What to do next
A drill only pays off if the team fixes the rough spots while the details are still fresh. Take the notes from the session, cut them down to a short action list, and keep it practical. Each item needs one owner, one due date, and one clear result.
Reset or rotate any emergency credentials you exposed during the test. Fix the gaps that slowed people down, such as missing MFA devices, stale phone numbers, or unclear approval steps. Update the runbook so the next on-call person can follow it without asking around. Keep a short note on what worked too, not only what failed.
Book the next drill before everyone drifts back into normal work. If you leave it open-ended, it usually slips for months. A 30 minute repeat test in six or eight weeks is better than a polished document that nobody touches.
Keep the next round narrow. Add one more system, then test it well. If you already checked cloud admin access, bring in your VPN, endpoint tool, password manager, or billing console next time. Teams learn faster when they expand the scope in small steps instead of trying to test every account in one long session.
That steady rhythm is what makes break glass account drills real. People remember the steps, the runbook stays current, and the backup admin accounts stay tied to working phones, tokens, and devices.
Some teams can handle this on their own. Others need outside help, especially when ownership is fuzzy or the environment has grown messy over time. In those cases, Oleg Sotnikov at oleg.is can help as a fractional CTO and startup advisor with emergency access planning, infrastructure review, and practical drills based on the systems your team actually uses.
Do the cleanup, schedule the next test, and widen the scope one system at a time. That is how a backup plan becomes something your team can trust during an SSO outage.
Frequently Asked Questions
What is a break glass account drill?
Think of it as a practice outage. Your team avoids normal SSO, retrieves the backup credentials, completes MFA, and signs in to the real admin tools. A good drill proves people can actually restore access, not just point to a document.
Which systems should we test first?
Start with the systems that stop work fast or help you recover the rest. For most teams, that means identity admin, email, chat, VPN, cloud consoles, shared secrets, and any tool you need to reach production safely. If customers feel the outage right away, include support, billing, deployment, or your status page early.
Who should take part in the drill?
Bring the people who would respond in a real outage. That usually means one owner for identity, one owner for each system in scope, and one person to take notes. Skip spectators, because they slow the drill down and hide confusion.
How often should we run these drills?
Run a quick check after any access change, like a phone swap, MFA reset, admin departure, or vault migration. Put a full drill on the calendar every quarter if your setup changes at a normal pace. Test more often if your team or tooling changes every month.
What counts as a failed drill?
The drill fails when people cannot finish the full path under outage rules. That includes expired passwords, missing MFA devices, stale phone numbers, blocked networks, unclear approvals, or an account that signs in but cannot reach the admin page. A partial login is not enough.
Where should we store emergency credentials?
Keep at least one recovery path outside the same SSO chain you plan to bypass. A sealed record, an offline vault method, or a tightly controlled backup store works better than a password manager that also needs SSO. Limit access to a small group and check that they can actually retrieve it.
How should we test MFA for backup accounts?
Use a spare method that real people can reach during an outage. That can mean backup hardware keys, recovery codes, or a second device under clear control. Then make someone complete the full login flow, because MFA often breaks the drill even when the password still works.
Should we test from home or only from the office?
Test from the place and device people would really use. If your on-call admin would work from home on a company laptop over VPN, use that setup. Lab machines and office-only tests miss DNS rules, firewall blocks, browser issues, and missing apps.
What should the runbook include?
Write the runbook in plain language and keep two versions. Leaders need a short page with approvals, restore order, and who declares the outage. Admins need the exact account names, MFA recovery steps, vault location, network requirements, and one safe action to confirm access works.
When should we bring in outside help?
Get outside help when ownership is fuzzy, the environment has grown messy, or nobody trusts the current plan. An experienced fractional CTO can map dependencies, tighten the recovery path, and run practical drills with your real systems. That helps a lot when your team keeps finding the same gaps and never closes them.