Colocated server runbook for your first bare-metal box
A simple colocated server runbook for first-time operators, covering remote hands, spare parts, monitoring, access rules, and outage prep.

Why a single server still needs a runbook
A single colocated server looks simple on paper. One box, one bill, lower monthly spend. Then a drive fails at 2:10 a.m., the VPN account belongs to a former contractor, the only admin is on a flight, and the datacenter asks one basic question: "What do you want us to do?" Cheap hardware stops being cheap the moment nobody can answer.
Most outages drag on for a dull reason: nobody owns the next step. One person has root access but cannot approve remote hands. Another person can approve work but does not know the server layout or spare part numbers. Alerts go to a shared inbox that nobody watches after hours. You can lose 45 minutes before anyone starts the fix.
The usual gaps show up fast. Teams forget to document access rules for SSH, VPN, console, and the facility ticket portal. They skip spare parts because the server is "just one box." They set noisy alerts and mute them, or they set none at all. They leave approvals vague, so simple actions like a reboot or disk swap turn into a string of calls and messages.
A runbook does not need to be a big manual. For one server, a short working document is enough if it answers the questions people ask under stress. Who gets paged first? Who can approve downtime? Which parts are on site? What exact steps should remote hands follow, and what should they never do without permission?
Keep it plain and current. If a new engineer cannot read it in ten minutes and act on it, it is too long. A good runbook will not prevent every failure, but it will stop a small hardware issue from turning into a half-day outage with three people guessing.
Decide who gets access before delivery
Access rules fail when everyone assumes someone else can fix the box at 2 a.m. Before the server reaches the rack, name the people who can act and write down what each person can do.
Start with three separate access levels. They are different, and mixing them causes trouble fast. A developer may need OS login to check logs or restart a service. A senior admin may need console access to recover the machine when the network stack breaks. Physical access is different again. That includes pressing the power button, moving cables, swapping a disk, or reading a serial number from the chassis.
List people by name, not role alone. Roles change. Names, phone numbers, time zones, and backup contacts save time when the main person is asleep or offline.
A simple split usually works:
- OS access for login, sudo, and service changes
- Console access for reboots, rescue media, BIOS, or IPMI
- Physical access for rack visits and remote-hands instructions
- Change approval for risky work
- Incident command for the person who makes the final call during an outage
That last point matters more than most teams expect. If two people can approve changes and disagree, the server sits broken while chat fills up with opinions. Pick one owner for each incident. Everyone else can give input, but one person decides whether to reboot, fail back, swap hardware, or wait.
Keep contact details practical. Include mobile numbers, messaging handles if your team uses them, and local time zones. Add a second contact for every person with elevated access. If you use an outside admin or a Fractional CTO, write down their access level and backup too.
A short access table beats a long policy. When the box drops off the network, you want one page that answers who can log in, who can touch it, and who can say "do it."
Build the first version step by step
Start with facts you should never have to guess under pressure. Put the server name at the top, then add the rack and unit position, asset tag, vendor model, and every serial number you may need for support or replacement. If a drive, RAID card, or power supply has its own serial, record that too.
Next, write down the setup that makes the box boot and talk to the network. Include the management IP, production IPs, VLANs, switch ports, MAC addresses, hostname, BIOS version, RAID layout, boot mode, and boot order. Keep it plain. A tired person should understand it in one read.
The runbook also needs actions, not just inventory. Write the reboot steps exactly as you would perform them: graceful shutdown, console check, power cycle, and what to verify after the system comes back. Do the same for rescue mode and full reinstall. List the installer image, where credentials live, how disks should be partitioned, and which service checks must pass before you call the server healthy.
You do not need many pages. One page for hardware identity and location, one for network and firmware notes, one for reboot, rescue, and reinstall steps, and one for backup, restore, and ownership is enough for most first deployments.
Backup notes need names, not vague ownership. Record what gets backed up, where it goes, how often it runs, how long restores usually take, and who approves a restore. If a reinstall requires someone to confirm the latest good snapshot, name that person.
Then test the draft during a planned restart. Follow the runbook line by line and note where you hesitate, skip a detail, or need a second system to fill in a blank. Those weak spots matter. A missing switch port number can waste 30 minutes. An old boot-order note can turn a routine reboot into a datacenter call.
After the test, cut anything nobody uses and fix anything that caused delay. The first version does not need to look pretty. It needs to work when you are rushed.
Set clear rules for remote hands
Remote hands can save a late-night trip to the datacenter, but only if they know what they can touch. A small mistake on the wrong cable or drive can turn a simple fix into a real outage.
Split remote-hands work into two groups: tasks they can do right away and tasks that need approval first. Pre-approved work should be low risk and easy to verify, like power cycling a named outlet, reseating a labeled drive, or moving one clearly tagged cable to a specific port. Anything that can affect data, boot order, firmware, or uplinks should stop until someone responsible signs off.
Short instructions matter more than perfect instructions. For common tasks, write a few plain steps with the rack position, server name, labels to match, and one clear stop rule. "If the labels do not match, stop and call" prevents a lot of damage.
Photos help more than most teams expect. Ask for one photo before the change and one after it. A wide shot shows the rack and unit number. A close shot shows the exact drive bay, cable, or port. That gives you a quick audit trail and makes follow-up work much easier.
Write down cost control too. Set a simple limit for urgent on-site work, such as a maximum spend and a maximum amount of time they can use without asking again. Past that limit, they pause and call the approved contact.
Use the same request format every time: device name and rack location, exact action to take, how to confirm success, spending limit, approval status, and who to call if something does not match.
A good ticket is boring and specific. "Reseat drive in bay 3 of server app-01, confirm green activity light, send photos" works. "Please check the server" does not.
Store the spare parts that save real time
A failed $60 part can keep a server down for half a day. In a runbook, the spare shelf matters almost as much as the server itself.
Start with the parts that fail often and take time to replace. For most first deployments, that means at least one matching drive stored at the facility, not in your office or at home. If the server needs a special tray or carrier, keep that with the spare too. A replacement drive without the right carrier does not help much.
Power items deserve the same care. Keep a known-good power cable, and store any small parts that fit only that server model. Rails, screws, adapters, and fan modules sound minor until someone on site cannot finish a swap because one piece is missing.
A small, clean inventory beats a big box of random hardware. Mixed drives, mystery cables, and unlabeled memory waste time because nobody knows what fits or whether it ever worked. Test each spare before you store it, then label it with the model, size, and purchase date. If firmware version or drive format matters, write that down too.
For most single-server setups, a basic spare kit is enough: one matching drive, the correct tray or carrier, one known-good power cable, and any model-specific part you cannot buy nearby on short notice.
Warranty terms do not solve this on their own. A vendor may promise a fast replacement, but "next business day" can still mean a long outage if a part fails late at night, on a weekend, or during a holiday. Compare the warranty delay with the downtime your service can actually handle.
One rule works well: if a missing part would stop service and you cannot replace it locally within a few hours, store it at the rack. That simple bit of spare-parts planning saves real time when remote hands are waiting for instructions and every extra minute costs you.
Watch the signals that matter
If your only alerts say CPU is high or RAM is low, you will miss the failures that hurt a colocated server most. A drive can start failing days before system load changes. A fan can stop, heat can rise, or packet loss can make the service feel broken while the server still looks up.
Watch the hardware signals that usually give you a short warning window: disk SMART health and read or write errors, RAID state, ECC memory errors, temperatures from CPU, board, and drives, plus packet loss and latency on the network path.
That is only half the job. Users do not care that the machine has spare CPU if logins fail or pages take 12 seconds to load. Watch one or two service checks all the time, such as whether the app answers a request, whether the database completes a simple query, or whether your public endpoint returns the expected status code.
Keep these views separate. One dashboard should answer, "Is the server healthy?" Another should answer, "Can customers use the service?" That split makes triage faster. If hardware looks clean but response time spikes, you know where to look first.
A simple setup is enough: one after-hours alert path that wakes a real person, one dashboard for hardware health, one dashboard for customer impact, and one escalation rule for when to call remote hands.
Do not send alerts to a place nobody checks after work. Email alone often fails this test. Use a phone alert, paging app, or chat channel with clear on-call ownership.
Trim noisy alerts before the first real incident. If something fires every night and nobody acts, fix the threshold or remove it. A short alert list that people trust is much better than a busy dashboard that trains everyone to ignore it.
Walk through one realistic outage
At 6:40 p.m. on a Friday, one drive in your RAID set starts throwing SMART errors. Five minutes later, monitoring sends an alert for a degraded array, higher disk latency, and the drive bay ID. Users do not notice anything yet, but you have lost your safety margin.
The on-call engineer opens the dashboard, confirms that the server is still serving traffic, and checks the runbook. It shows the server name, rack position, chassis model, RAID layout, disk slot map, and the serial-number pattern for each drive. That matters because the fastest way to turn a small issue into a long outage is to pull the wrong disk.
The ticket to remote hands is short and precise. It says which bay to replace, which spare drive to use, where that spare is stored, and who must approve the swap before anyone touches the server. If the runbook includes a photo of the front bays and drive labels, the tech has very little room for guesswork.
Remote hands confirms the label, removes the failed drive, and inserts the spare from your stock. No one needs to search local stores. No one debates compatibility. That alone can save one to three hours on a bad evening.
Now the job shifts back to your team. The on-call engineer watches the rebuild in the RAID controller and tracks three things: rebuild progress, read latency, and error count. A second person, often the team lead or founder, posts updates to the people waiting on the fix. One person repairs. One person communicates. That split keeps both jobs clear.
The runbook cuts delay at each step. Monitoring catches the problem before customers report it. Asset notes identify the exact failed drive. Access rules stop an unapproved swap. Spare stock removes the wait for parts. Named owners keep rebuild checks and updates from being missed.
By 8:05 p.m., the array is rebuilding, traffic is stable, and everyone knows the next update time. The failure still happened. It just stayed small.
Mistakes that stretch a short issue into hours
Most long outages start with a small problem and a missing detail. A fan fails, a boot disk drops, or the server stops answering after a reboot. The hardware issue may take 10 minutes to fix, but confusion around access, notes, and parts can turn it into half a day.
One common mistake is giving full admin rights to too many people. It feels faster at first. In practice, it creates guesswork, accidental changes, and no clear owner during an incident. One person should approve risky actions, and only a small number of people should have the power to touch BIOS, IPMI, VPN, firewall rules, and rescue tools.
Another problem is keeping the only recovery note in one person's head. If that person is asleep, on a flight, or no longer with the company, the team stalls. The runbook should live in a shared place and answer basic recovery questions without a phone call.
A short checklist helps. Make sure you know who can approve a reboot or power cycle, who can access out-of-band management, where recovery steps and passwords are stored, which spare parts match the exact server model, and what remote hands can do without extra approval.
Spare parts cause more wasted time than people expect. An unlabeled drive, rail kit, or power supply is almost useless when the clock is running. Teams often discover too late that the spare uses the wrong connector, wrong form factor, or wrong firmware. Label every part with the server model, bay size, and any setup note that matters.
Monitoring can also give false comfort. Default checks often tell you only that the server is up. They do not warn you when SMART errors rise, a RAID array degrades, a power supply dies, or temperatures drift up. A box that still answers ping can still be one reboot away from a hard stop.
Remote hands is another trap. Many teams never test the wording they will send during an outage. Then they ask the datacenter to "check the server" and get a vague update 40 minutes later. Run one dry request before you need it. Write exact actions, exact labels, and exact success checks. That small habit saves more time than buying faster hardware.
Quick checks before you call it ready
A runbook fails when it only looks complete on paper. Before you trust it, make someone else use it for one small task and watch where they stop, guess, or call you.
Start with people. Write down the names, roles, phone numbers, backup contacts, and time zones for everyone who may need to act fast. Add approval rules too: who can authorize a reboot, power cycle, drive swap, or console session. Staff changes break more runbooks than hardware does, so review this page often.
Then test the parts that usually go wrong under pressure. Give remote hands one sample task, such as swapping a labeled SSD or moving a network cable to a marked port, and see whether your notes are enough on their own. Match every spare part to the exact server model, tray, rail, cable type, and power supply in the rack. Confirm that alerts cover hardware trouble and service trouble, not just one side. Run one planned recovery test, time it, and record each step that felt slow or unclear. Verify that login details, out-of-band access, and approval contacts still work from outside your office network.
A short test tells you more than a long document. Ask remote hands to replace a failed boot drive using only your notes. If they need three extra calls to confirm the serial number, tray type, or reboot order, fix the runbook now, not during an outage.
The best runbook feels a little boring because every detail is already settled. If another person can recover the server without you on the phone, it is ready.
What to do after the first draft
A runbook starts as a guess. After the first real incident, it becomes useful.
Update it while the details are still fresh. If a drive failed, a switch port changed, or someone added a new out-of-band access step, edit the runbook the same day. Small delays turn clear memory into bad notes.
Use each incident as a test. If remote hands solved the problem in 10 minutes, keep that step. If they stalled because your note said "check the bad disk" without a slot number or label, rewrite it with exact details.
After every change, do a short review. Add any hardware swap, cable move, BIOS change, or access change. Remove notes that nobody used during the last incident. Expand any step that made someone stop and ask a question. Record what the datacenter staff needed from you before they could act. Note which alert fired first and which one arrived too late.
This is where most runbooks get bloated. Resist that. A page full of old workarounds and half-true notes slows people down. Keep the steps that help at 2 a.m. Cut the rest.
Consistency matters once you have more than one server. Reuse the same format every time: system name, rack location, serial number, remote access method, common failure steps, spare parts on hand, and who can approve work. When every server follows the same shape, nobody wastes time hunting for basic facts.
A simple example makes the rule clear. If the last outage took an extra 25 minutes because the hands team could not confirm which NIC port carried management traffic, add that exact port mapping to the runbook. Do not write "verify network." Write the port, label, and expected link state.
If you want a second pair of eyes, Oleg Sotnikov at oleg.is can review your colo setup, monitoring, and access rules. His background as a Fractional CTO and advisor is especially useful for small teams that need solid operations without adding a lot of process.
When the next server arrives, do not start from scratch. Copy the same template, update the facts, and keep the format unchanged.
Frequently Asked Questions
Do I really need a runbook for just one server?
Yes. One server still needs clear recovery steps, contact names, access rules, and spare-part notes. When something fails late at night, a short runbook cuts guesswork and gets people moving fast.
What should go on the first page of the runbook?
Start with the facts nobody should guess under stress: server name, rack and unit position, asset tag, model, serial numbers, management IP, production IPs, hostname, and switch port details. Add the first on-call contact and the person who can approve risky actions.
Who should have access to a colocated server?
Split access into OS login, console or IPMI, physical access, change approval, and incident command. Put real names, phone numbers, time zones, and backup contacts in the document so nobody has to hunt for the right person.
How do I write safe instructions for remote hands?
Give remote hands a short ticket with the server name, rack location, exact action, success check, spend limit, and one stop rule. Pre-approve only low-risk work like a power cycle or a swap of a clearly labeled drive.
Which spare parts should I keep at the datacenter?
For most single-server setups, keep one matching drive on site with the right tray or carrier, a known-good power cable, and any small model-specific part that could block a repair. Label each spare so the tech does not guess under pressure.
What should I monitor besides CPU and RAM?
Watch disk SMART errors, RAID state, ECC memory errors, temperatures, packet loss, and latency. Pair that with one or two service checks, like an app request or a simple database query, so you catch customer impact early.
Where should I store the runbook?
Keep it in a shared place that the team can reach from outside the office network. Do not leave recovery steps or access notes in one person's head or on one laptop.
How often should I test the runbook?
Test it during a planned reboot, after major changes, and after every real incident. Ask someone else to follow it line by line, because fresh eyes catch missing details faster than the author does.
What mistakes turn a short issue into a long outage?
Most long outages start with vague approvals, too many admins, old contact details, unlabeled spares, or alerts that nobody trusts. Teams also lose time when they send remote hands a fuzzy request like "check the server."
Should I get outside help with my colo setup?
If your team lacks on-call depth, clear access rules, or confidence in recovery, outside help makes sense. Oleg Sotnikov can review your colo setup, monitoring, remote-hands process, and runbook so small gaps do not turn into expensive downtime.