Self-hosted runners on bare metal vs cloud VMs for CI
Compare self-hosted runners on bare metal vs cloud VMs across queue stability, monthly cost, and recovery speed when CI slows daily work.

Why CI queues start hurting the team
A CI queue becomes a problem long before builds stop completely. When developers wait 10 or 15 minutes just to start a test run, they switch tasks, lose context, and come back slower. Reviews drag because nobody wants to merge code that is still stuck in line. Releases slip for the same reason. Hotfixes feel worst of all, because the fix may be ready while customers still wait for a runner.
Not every delay means the same thing. Some pipelines are slow because the jobs are heavy: large test suites, bulky Docker builds, or steps that download the same dependencies over and over. Other teams have jobs that would finish quickly if they could start on time, but too few runners force everything into a queue. Those are different problems. One needs pipeline cleanup. The other needs more runner capacity, better scheduling, or both.
The strain usually shows up in small ways first. Pull requests stay open longer. Release branches sit unmerged near the end of the day. Hotfixes compete with routine test jobs. People rerun jobs just to see if the queue moves.
Rare spikes are annoying. Daily friction is what wears a team down. A Monday morning backlog after a big merge is normal. A queue that slows every review every afternoon is something else. People start batching changes to avoid another build, which makes each merge bigger and riskier. They test less often. They delay small fixes. Over a month, that habit costs more than one bad incident.
That is why CI queue stability matters more than peak speed on a benchmark. A runner setup can look cheap or fast in short bursts and still hurt the team if normal work turns into waiting. That pain is often what pushes teams to compare self-hosted runners on bare metal and cloud VMs in the first place.
What bare metal runners change
Bare metal runners usually make CI feel more predictable. You get the same CPU model, the same disk speed, and the same network path every day. That consistency matters more than many teams expect. A test suite that takes 9 minutes on Monday is far more likely to take about 9 minutes on Thursday too, instead of bouncing between 7 and 18 because a shared VM host got noisy.
Jobs also tend to start faster because the runner is already there. You do not wait for a VM to boot, attach storage, warm caches, or pull the same base image from scratch. If your team pushes code all day, those small delays add up fast. Saving even 30 to 90 seconds per job can remove a lot of queue pain before you change a single test.
Capacity planning changes as well. With bare metal, you usually buy or rent enough machines for normal load, then leave headroom for the busy parts of the day. If mornings are packed with merges and afternoon deploys trigger full pipelines, you need spare capacity before the rush starts. Teams that size runners for average load often regret it quickly.
The upside is simple: stable hardware gives you steadier build times, warm runners cut startup delay, and local caches stay useful. The trade-off is just as simple. Your team owns recovery when hardware fails. If a disk dies, a power supply fails, or a network switch starts dropping packets, nobody replaces the machine for you. You need spare parts, remote hands, or someone who can get on-site.
That is why experienced teams keep one extra runner ready and keep runner setup scripted. Oleg Sotnikov often talks about lean infrastructure and careful sizing, and that mindset fits bare metal well. Predictable machines are great, but only if you can replace them fast when one goes down.
What cloud VM runners change
Cloud VM runners solve a different problem. They make it much easier to add capacity when the queue suddenly grows. If ten pull requests land in an hour, you can spin up more runners instead of waiting for one fixed machine to clear the backlog. That flexibility is the main reason many teams move CI to VMs first.
The catch is easy to miss. More capacity does not always mean fast starts.
A cloud runner usually needs a few steps before it does useful work. The VM boots. The runner connects. The job pulls its base image. Caches start cold if the machine is fresh. On a busy day, that setup time may feel small. After a quiet night or a scale-up event, the first jobs often wait longer than people expect.
Small teams notice this quickly. A test suite that takes 6 minutes on a warm runner can take 9 or 10 minutes on a new VM if it has to download everything again. If developers push often, those extra minutes stack up and the queue still feels slow even though more machines are available.
Cloud VMs also bring more variation in job speed. You do not control the physical host, so the same build can run at different speeds across the day. One run may finish in 7 minutes, the next in 8.5, with no code change at all. Shared hardware, storage performance, and network noise all play a part. That variability gets old fast when the team wants stable feedback.
Then there is the bill. Compute is only part of it. Storage for runner disks and cached images adds up. Network traffic can rise if jobs pull large containers or upload big artifacts on every run. A setup that looks cheap per hour can cost much more by the end of the month if runners start cold often and move a lot of data.
Cloud VM runners work best when load changes a lot and fast recovery matters more than perfectly steady build times. They give you room to grow, but they reward teams that manage images, caches, and artifact sizes carefully.
Queue stability under real load
Teams rarely feel CI pain during a calm morning. The queue starts to hurt when many merge requests land in a short window, usually before lunch, before a release, or after a long branch finally gets split into smaller changes. That is when the gap between bare metal and cloud VMs becomes obvious.
With steady daily traffic, both setups can look fine on paper. A team that runs 20 builds spread across the day may see similar average wait times on both. The trouble starts with bursts. If 12 developers push within 15 minutes, fixed bare metal runners often keep queue times more predictable because the machines are already there, already warm, and already holding the same caches.
Cloud VM runners can absorb spikes, but only after new workers exist. That sounds obvious, yet many teams forget the startup delay. The autoscaler sees the queue, asks for more VMs, waits for them to boot, pulls the runner image, downloads base containers, and only then starts the job. By that point, the spike may already be fading, and people still waited.
Most of that hidden delay comes from the same few places: cache warm-up after a fresh machine starts, container image pulls from the registry, dependency installs that a warm runner would have skipped, and artifact upload time after the build finishes.
On fixed hardware, those costs do not vanish, but they flatten out. Local caches stay hot. Images stay on disk. Network paths stay familiar. Queue times feel steadier because each job starts from a known state instead of rebuilding the same setup again and again.
That does not mean bare metal wins every burst. If the spike is much larger than normal, a fixed pool hits a hard ceiling. Cloud VM runners help most when bursts last long enough to justify the startup time, or when workloads vary so much that keeping idle machines around feels wasteful.
A better way to judge real behavior is to measure p50 and p95 queue time, not just average build time. If most jobs start fast but every afternoon a chunk of them waits 6 to 10 minutes, the autoscaler is reacting too late. If a warm bare metal pool keeps that delay closer to 1 or 2 minutes, the team will feel the difference every day.
How the cost curve shifts over a month
Monthly CI cost rarely matches the sticker price on a server or VM. A cleaner way to think about it is cost per successful build. A runner that looks cheap on paper can get expensive once you count failed jobs, retries, and the hours people spend waiting.
Bare metal usually gives you a flatter monthly bill. You pay for the machine, power, storage, and some maintenance whether the runner is busy or idle. That works well when the team pushes code every day and keeps the runner warm through most of the week. The more clean builds you get from the same hardware, the lower the unit cost drops.
Cloud VM runners move the other way. Quiet weeks can be cheap. Release week can spike fast. Teams often budget for compute and miss the extras around it: storage, snapshots, caches, log retention, idle time, retries, and the admin time spent fixing runner images or bad startup scripts.
Bare metal has hidden line items too. If one machine carries most of the load, you may need a second box or at least spare disks and RAM on the shelf. That spare capacity costs money even when it sits unused. If a drive dies and the team loses half a day, that loss belongs in the math.
Cloud has a different trap. Teams like the flexibility, then leave larger VMs running just in case. Storage snapshots keep growing. Cache volumes stick around. Egress charges appear when builds pull or push a lot of artifacts between regions. None of these costs looks dramatic alone, but together they can erase the savings from avoiding hardware.
A small example makes the point. Say a team spends $500 a month on two bare metal runners, spare parts, and power, and gets 8,000 successful builds. That is about $0.06 per successful build. A cloud setup might show $380 in compute, then another $140 in storage, snapshots, and network charges, plus a higher retry rate during busy hours. In that case, the cloud bill is not really lower.
Steady daily load usually favors bare metal. Uneven load often favors cloud. The wrong choice gets expensive when teams track server price and ignore everything around it.
When runners fail and how teams recover
CI trouble rarely starts with a dramatic outage. It usually starts with small failures that pile up: disks fill, a base image gets stale, one job never exits, or a host reboots and takes half the queue with it. If nobody notices quickly, the team loses hours to retries and guesswork.
The same problems show up in most runner pools. Disks fill with caches, Docker layers, and old artifacts. Bad images break every fresh job in the same way. Stuck jobs hold a slot for 30 or 40 minutes. One host crash can remove several runners at once.
Bare metal and cloud VM runners fail differently. On bare metal, one strong host often runs many executors. That keeps cost down, but one host outage can wipe out most of the pool in a single hit. If a team put eight runners on one machine, one kernel issue, power problem, or bad update can turn a healthy queue into a full stop.
Cloud VM runners spread risk better, but they can spread mistakes faster too. One broken VM image, startup script, or dependency change can poison every new runner. Instead of one bad machine, you get a wave of identical failures, and builds keep failing until someone stops new instances from launching.
Teams recover faster when they use simple rules and stick to them. Set alerts for low disk space, long queue wait time, and jobs that run far longer than usual. Keep runner logs in one place so people can compare failures instead of guessing. Decide the replacement window before anything breaks: restart in 5 minutes, rebuild in 15, fail over in 30.
If a runner misses that window, remove it from the pool and replace it. Do not leave sick machines online just because they sometimes pass. They create the worst kind of delay: random delay.
One habit helps more than most teams expect. Keep one failure small. Split bare metal runners across at least two hosts. Roll VM image changes to one canary runner first, then expand. That is often the difference between one failed pipeline and a whole afternoon of red builds.
A simple example from a growing team
Picture a team of 20 developers who push code all day, mostly between 9 a.m. and 6 p.m. Their CI jobs are not tiny. Each pull request runs unit tests, integration tests, a Docker build, and a few checks that hit databases and test services. During busy hours, they create bursts of 10 to 15 jobs at once.
Now compare two setups for the same workload. The first uses two bare metal hosts kept in an office or a rented rack. The second uses an autoscaled pool of cloud VM runners that grows during peak hours and shrinks at night.
| Metric | 2 bare metal hosts | Autoscaled cloud VM pool |
|---|---|---|
| Average queue time | 3 to 6 min | 2 to 5 min |
| Peak queue time at lunch and late afternoon | 12 to 18 min | 5 to 10 min |
| Average build time | 18 to 22 min | 22 to 28 min |
| Failed retries per month | 8 to 12 | 4 to 8 |
| Monthly spend | $1,200 to $1,800 | $1,900 to $3,200 |
The bare metal side looks better on cost, and often on raw speed too. That usually happens because caches stay warm, container layers are already there, and the hosts do not need to boot from scratch when the queue spikes. For a team that runs the same heavy jobs every day, that matters a lot.
The cloud VM pool usually wins on elasticity. If ten developers push within five minutes, the pool can add runners instead of forcing everyone into one long line. Still, the queue does not disappear by magic. New VMs need time to start, pull images, and fetch dependencies. If jobs are heavy, that startup tax shows up in build time.
Failure recovery is where the trade-off becomes clearer. If one bare metal host dies, the team loses half its runner capacity at once. Queue time can jump from 5 minutes to 30 minutes until someone fixes the host or shifts jobs around. On cloud VMs, one runner failure hurts less because the pool can replace it quickly. But cloud pools have their own weak spots: bad images, quota limits, or broken startup scripts can hit many runners at the same time.
For this kind of team, the choice should follow job shape, not fashion. If the workload is steady and predictable, two strong bare metal hosts often give the best cost per build. If traffic swings hard during the day and the team cannot tolerate queue spikes, the VM pool may cost more but keep work moving.
Common mistakes that make queues worse
Teams often blame the runner when the delay starts inside the pipeline. Faster hardware can hide waste for a week or two, then the queue comes back.
A common mistake is buying bigger machines before trimming slow test steps. If jobs install dependencies from scratch, rebuild the same assets every time, or run large test suites on every commit, a faster runner just burns more money. The work is still bloated.
Measure where the minutes go before you spend anything. One team might spend 12 minutes on setup and only 3 on actual tests. Cutting duplicate setup work often helps more than doubling CPU or memory.
Another mistake is mixing deployments with regular builds on the same runners. A normal pull request then competes with a release job, image publish, or migration step. That creates an ugly queue because urgent work and routine work block each other all day.
Separate pools usually fix this quickly. Keep everyday build and test jobs on one set of runners. Put deployment jobs on another. Even a small team feels the difference because developers stop waiting behind release tasks.
Storage problems also grow quietly. Teams ignore cache placement, skip disk cleanup, and keep artifacts far too long. Then each job spends extra time downloading old files, unpacking oversized caches, or fighting for free disk space.
This gets worse on both bare metal and cloud VM runners. A slow shared cache can drag down every job. A nearly full disk can turn a healthy runner into an unpredictable one.
Teams also scale runner count without measuring real concurrency. Adding six more runners sounds safe, but it does not help if only two jobs can make progress at once because they all wait on the same database, package registry, or long integration test.
Count how many jobs truly run in parallel during busy hours. Size the pool around that number, then leave a bit of room for retries and short spikes.
Good CI feels boring. Jobs start quickly, deployments stay out of the way, and new runners appear only when the data says you need them.
Quick checks and next steps
When teams compare self-hosted runners on bare metal and cloud VMs, they often skip the simplest numbers. Start with queue data from the last two weeks, not vendor prices or hardware specs.
Look for the hours when jobs pile up. A runner setup can look fine in the morning and still slow everyone down after lunch when merges, test runs, and release builds hit at once. Check how often jobs wait more than 3 to 5 minutes before they start. Break that wait down by job type: unit tests, integration tests, Docker builds, and deploys. Measure how long a fresh runner takes to become usable, not just how long it takes to be created. And be honest about who gets pulled in when runners stop taking work.
That third number matters more than it seems. If a cloud VM runner starts in 90 seconds but needs another 4 minutes to pull images, warm caches, and register cleanly, burst capacity is weaker than it looks. If a bare metal runner is already warm but fails once a week and nobody owns the fix, the queue still loses.
Put a price on delay. If six engineers each lose 10 minutes waiting on CI twice a day, that is two hours gone every day. Over a month, that cost can beat the savings from cheaper runner hardware or low spot VM prices.
A small pilot usually tells you more than a long debate. Move one busy workflow first. Set targets for average queue time, p95 queue time, monthly runner spend, and recovery time after one forced runner failure. Keep the pilot long enough to catch at least one busy release day.
Then do one failure drill on purpose. Stop a runner, break registration, or remove a cache volume. Watch how long the team needs to notice, reroute jobs, and recover normal throughput. If nobody owns that sequence, assign an owner now.
If the team needs a second set of eyes, Oleg at oleg.is works with startups and small businesses on CI/CD setup, infrastructure sizing, and failure recovery as part of fractional CTO and advisory work. The useful part is not a bigger tool list. It is getting clear numbers on where the queue hurts and fixing the part that slows daily work first.
Frequently Asked Questions
When does a CI queue become a real problem?
It becomes a real problem when people wait often, not just during a rare spike. If jobs sit for 10 minutes before they even start, developers switch tasks, reviews slow down, and small fixes pile up.
Are slow builds and long queue times the same thing?
No. A slow build means the job itself takes too long. A long queue means the job could finish fast, but not enough runners can start it on time.
Why do bare metal runners often feel faster?
Bare metal runners stay warm and predictable. They keep the same hardware, local caches, and images on disk, so jobs usually start faster and finish at a steadier pace.
When do cloud VM runners make more sense?
Cloud VMs fit teams with bursty load. If many pull requests land at once or traffic changes a lot during the day, extra VMs can absorb that spike better than a fixed pool.
What should we measure besides average build time?
Track queue time, especially p50 and p95, not just average build time. Also measure how long a fresh runner needs before it can do real work, because boot time alone hides a lot.
Is bare metal usually cheaper over a month?
Often, yes, if your team runs heavy builds every day. A fixed monthly bill, warm caches, and fewer cold starts can lower cost per successful build.
Why do cloud runners still feel slow after autoscaling?
Fresh VMs start cold. They need to boot, register the runner, pull base images, download dependencies, and rebuild caches before the first job moves at full speed.
How should a team handle runner failures?
Set clear recovery rules before anything breaks. Restart fast, rebuild quickly, and remove flaky runners from the pool instead of hoping they recover on their own.
What mistakes usually make CI queues worse?
Teams often buy bigger runners before they trim waste in the pipeline. Rebuilding the same assets, reinstalling dependencies, and mixing deploy jobs with normal test jobs can clog the queue all day.
How can we test bare metal vs cloud without a full migration?
Run a small pilot with one busy workflow. Compare queue time, build time, monthly spend, and recovery after one forced failure, then keep the option that makes daily work feel smoother.