Aug 01, 2025·8 min read

What to monitor before hiring SREs at a small company

Learn what to monitor before hiring SREs: the few signals that show user pain, queue buildup, and bad releases without growing a noisy stack.

Table of Contents

Why small teams drown in dashboards

A small company can collect 80 charts and still miss the one problem that costs money. Teams add graphs because every tool offers them, not because someone truly needs them. A few weeks later, the dashboard turns into wallpaper.

Alerts make it worse. The app sends one alert, the database sends another, and the queue tool sends five more. Soon nobody knows what matters first, so people mute the noise and move on.

Users usually feel trouble before infrastructure charts look dramatic. A checkout page can slow to six seconds while CPU looks normal. A signup email can arrive ten minutes late while memory stays flat. A queue can grow for half an hour before any server graph looks odd.

That is why too many charts often hide simple problems. The team watches system health and misses customer pain. Support tickets pile up, jobs wait in line, and a bad release slips through because the dashboard answers the wrong questions.

A metric earns its place only when someone will check it and act on it. If nobody owns it, or the team cannot do anything with it, it is noise.

Good signals usually pass four simple tests. They match a real user action such as logging in or paying. One person knows what to do when the number goes bad. The team can see it change within minutes. And it answers a plain question: "Are users stuck right now?"

Start smaller than you think. A few user pain metrics, a few queue checks, and a few release health signals will beat a giant observability setup that nobody trusts. Fewer charts may look less impressive, but they save time and catch the problems people actually notice.

Start with the user path that pays the bills

A small team does not need fifty charts on day one. It needs a short list of actions that turn a visitor into a customer, or keep a customer active. If those actions work, the business keeps moving. If they break, users feel it fast.

Most companies can name those actions in a few minutes: create an account, start a trial, sign in, pay, renew, upload the first file, send the first request, invite a teammate, or finish setup. Pick only the steps tied to revenue, activation, or retention. Leave the rest for later.

For each action, write down what success looks like in plain language. "User signs up and reaches the dashboard in under 20 seconds" is clear. "Payment completes and access changes right away" is clear too. If nobody can describe a good outcome, nobody will agree when the system slips.

Then track three things on that path: failure rate, wait time, and retries. Failure rate tells you when users get stuck. Wait time tells you when the system still works but feels bad. Retries tell you when users or background jobs have to push again because the first attempt did not go through.

Picture a simple SaaS product that makes money when users upload data and run a report. If uploads fail 2% of the time, users complain. If uploads succeed but take 90 seconds instead of 10, many users leave before the report starts. If the client retries three times before success, your queues and workers may already be under strain even though the final status says "ok."

That is the right place to start before hiring SREs: the few steps where money, trust, and daily use meet. Keep the list short enough that one person can check it in five minutes and the whole team can discuss it without opening ten tabs.

Signals that show user pain fast

Start with the numbers users feel within seconds. The main question is simple: "Can people do the thing they came here to do?"

The first signal is error rate on the flows that make money or keep customers active. Pick two or three paths such as signup, login, checkout, file upload, or report export. If errors jump on one of those paths, users notice right away even if the rest of the app looks fine.

Response time matters too, but only where people actually wait. Track p95 latency on pages and actions where slowness feels broken. Average speed hides bad experiences. If most requests finish in 300 ms but the slowest 5% take 8 seconds on checkout, that is a real problem.

Check uptime from outside the app, not only from your own logs. Internal health checks can stay green while real users hit DNS issues, broken redirects, CDN problems, or region-specific outages. A simple external probe on your main entry points gives you a more honest view.

A small starting set is enough: error rate for your main user flows, p95 response time for pages where users wait, outside-in uptime for homepage, login, and checkout, plus failed session counts compared with normal traffic.

Numbers alone can still mislead you. Put support complaints next to your dashboards. If customers say "billing is stuck" and your charts look normal, your monitoring is missing the real path, the wrong threshold, or the user context.

Failed session counts or session replays can close that gap. A team might see only a mild rise in API errors, while support tickets spike and session data shows users clicking the same button three times before giving up. That is user pain. It deserves attention before any deeper observability work.

Signals that show queue buildup

Queue trouble often starts in the background. The app still opens, the dashboard looks calm, and nobody sees a fire yet. Meanwhile imports, emails, reports, or billing jobs sit and wait.

Start with queue depth, the number of jobs waiting to run. If that count keeps rising for 10 or 15 minutes, work is arriving faster than workers can finish it.

Depth can fool you, though. A queue may spike during a busy hour and recover on its own. That is why the age of the oldest job matters so much. If 200 jobs are waiting but the oldest is only 30 seconds old, you may be fine. If 40 jobs are waiting and the oldest is 18 minutes old, users already feel the delay.

Then measure the full time from job creation to job finish. That tells you what the customer experiences, not what the worker reports. A report generator that used to finish in 45 seconds but now takes 9 minutes creates real pain even if every worker process still looks active.

Worker failure rate and retry rate add the missing context. Retries can make a system look busy while it gets almost nothing done. If failures rise, retries rise, and queue age rises at the same time, the problem usually is not traffic alone. It is often bad input, a slow dependency, or a release that made each job heavier.

One pattern deserves extra attention: backlog grows while workers stay busy. That usually means capacity is not the whole story. Workers may be stuck on slow jobs, retrying the same work, or wasting time on one bad step.

A small SaaS team can spot this fast. Say invoice PDFs normally finish in under a minute. After a release, workers stay near full usage, queue depth doubles, oldest job age reaches 25 minutes, and retries jump. That points to a release problem much faster than CPU or memory graphs would.

For a small team, these signals are often enough. They stay close to user pain, and they tell you whether to add workers, fix broken jobs, or roll back code.

Signals that expose bad releases

Make Incidents Easier to Read

See user pain, queue buildup, and release risk sooner with a simpler setup.

Get Advice

Bad releases usually leave fingerprints within minutes. You do not need a giant observability stack to catch them. You need a small set of charts that tie deploys to user impact.

Start by putting every deploy marker on the same charts the team already checks each day. If errors, latency, or failed jobs spike two minutes after a release, the chart should make that obvious. When deploy markers live in a separate tool, teams waste time arguing about timing.

Compare error rate right before and right after each release. A 15 to 30 minute window works well for small teams. Check the overall error rate, but also the endpoint or action the release touched. A tiny checkout change can double payment errors while the total system graph still looks fine.

Rollback count matters. Time to recovery matters even more. If the team rolls back often, or needs 45 minutes to stop the damage, that tells you more than a green deploy status ever will. Keep a simple record of how often releases need a rollback and how long users feel the problem.

For mobile or desktop apps, watch session crash rate after updates. One bad client release can look quiet on the server while users deal with frozen screens, restarts, or lost work.

One more check catches a lot of trouble: watch the business flow that matters most after every release. Pick one path such as signup, invoice payment, report export, or order submission. If that flow drops from 96% success to 82%, you have a release problem even if CPU, memory, and uptime stay normal.

A small team can get far with one screen that shows deploy markers, error rate, rollback count, time to recovery, and one business flow success rate. If a release touches the app client, add session crash rate.

How to choose your first dashboard

Your first dashboard should answer one question in under 30 seconds: can users still complete the part of the product that pays the bills? If it cannot answer that, it is already too big.

Start with one user flow and one queue. For a small SaaS team, that might be signup and the background job that sends trial emails or creates accounts. This keeps the dashboard tied to real customer pain instead of a wall of system graphs.

Keep the first version small. Five to seven charts are enough for most teams. You want one failure signal such as success rate or error rate on the chosen flow, one delay signal such as p95 response time or oldest job age, one release risk signal such as errors by deploy version, one volume chart so traffic spikes do not fool you, and one simple note that says who owns the dashboard during release windows.

Set thresholds that people can explain without opening a runbook. If signup success drops below 98% for five minutes, page the on-call person. If the oldest queue job sits for more than 10 minutes, alert the team channel. If errors double within 15 minutes of a deploy, stop the rollout and check the release.

Ownership matters more than most teams expect. During releases, one person should watch the charts, one person should ship, and one person should decide whether to roll back. If one person does all three, they miss things.

One more rule saves a lot of clutter: after every incident, delete any graph that nobody checked. A dashboard should earn its space. If a chart did not help during the last real problem, cut it.

A simple example from a small SaaS team

Find Queue Trouble Faster

Review job age, retries, and worker behavior before delays pile up.

Get Queue Help

A six-person SaaS team shipped a Friday change to invoice export. The release looked minor. They changed how export jobs moved from the app into a background worker, and the deploy finished without errors.

Ten minutes later, one chart started bending the wrong way: export queue age. Jobs still entered the queue, but the oldest waiting job kept getting older. That mattered more than CPU or memory because customers felt it first. A busy server can still work fine. A stuck queue cannot.

Support messages arrived soon after. Customers said exports sat in "processing" and never finished. Server load still looked normal. The database was fine. The web app still loaded fast. If the team had watched only host metrics, they would have missed the problem until more customers complained.

They only needed three charts to find the cause: export queue age, export success rate, and a deploy marker on the same timeline.

Those charts told a clear story. Queue age rose right after the release. Success rate dropped at the same time. The deploy marker showed the exact moment the change went live. The bug sat in the worker code, not in the API, database, or infrastructure.

One engineer checked the worker logs and found a retry loop caused by a bad state check. Jobs failed, went back into the queue, and blocked newer work behind them. The team rolled back in a few minutes, and queue age started falling almost at once.

That is why observability at a small company should stay narrow. A few release health signals beat a wall of graphs. Pick the numbers that match real user pain, then put them beside your deploy history. When something breaks, you want a short path from "customers are stuck" to "this release caused it."

Mistakes teams make early

When teams ask what to monitor before hiring SREs, they often start too low in the stack. They watch CPU, memory, and disk, then miss the moment signup breaks or checkout stops working.

A calm server does not mean a healthy product. If users hit an error on the payment page for 10 minutes, the damage is real even when host load looks fine.

They monitor boxes, not the path users take

Small teams often add charts service by service because that feels concrete. After a month, they can tell you Redis memory usage and API latency by endpoint, but they cannot answer one basic question: can a new user sign up, verify an email, and start using the product?

That gap matters more than people expect. A company can survive a brief CPU spike. It does not get the same grace when leads cannot create an account on Monday morning.

A better first habit is simple: map one money path and watch it end to end. For many teams, that means signup, login, payment, and one first success action inside the app.

They let alerts rot

Early alerts often start with good intentions and turn into background noise fast. Someone adds a page for every traffic spike, every deploy bump, and every short queue jump. Two weeks later, the team stops taking the channel seriously.

Old alerts make this worse. The product changes, the architecture changes, and the alert stays. Six months later it still fires for a job that no longer matters, while a new import flow fails quietly.

Teams also mix debug charts and business charts too early. Then revenue, signups, pod restarts, queue depth, and one-off test metrics all fight for space on the same screen. When something breaks, people waste time scanning noise.

Keep two views instead. One should answer, "Are users getting through the main path?" The other should help engineers debug why.

A quick review every few weeks helps. Look at which alerts fired and nobody acted on, which user-facing steps still have no signal, which charts only matter during deep debugging, and which alerts belong to features the team removed. Most teams do not need more graphs early. They need fewer signals and cleaner alerts.

A short weekly checklist

Build a Lean Monitoring Setup

Start with a small dashboard your team can scan in under 30 seconds.

Plan Setup

A small team does not need a long review. Once a week, spend 15 minutes on the same few checks and write down the answers in plain words. If the answer takes too long, the monitoring is already too messy.

Ask questions that point to user pain, queue trouble, and release risk:

Can one person tell within 30 seconds whether users can sign in and complete a payment?
Did any queue stay old for hours instead of draining?
Did the last release change error rate, response time, or job wait time?
Did the team dismiss the same alert three times or more?
Can you remove one chart that nobody used this week?

A simple rule helps: every chart should answer a live question. "Can users pay?" "Are jobs stuck?" "Did yesterday's release hurt anything?" If a graph cannot do that, it probably does not belong on the first dashboard.

One small SaaS team can run this review on Monday morning and catch a lot with very little. If sign-in works, payments work, queue age stays flat, and releases do not move error rate, the week starts from a steady place.

What to do next

Keep the setup smaller than you think you need. A small team does not need a wall of charts. It needs a few signals that tell you, fast, whether users can finish the action that keeps the business running.

Start with one user flow, one queue, and one release view. Pick the flow that brings in money or keeps customers active, such as signup, checkout, or report delivery. Then choose the queue that can quietly pile up and hurt that flow. Last, add a release view that shows errors, latency, and rollback risk after each deploy.

A good starting set is simple: one chart for success rate and latency on the main user flow, one chart for queue depth and oldest job age, one chart for errors right after a release, and one clear owner for each chart and alert.

Run that setup for two weeks before adding more tools or metrics. That short trial tells you a lot. You will see which alerts fire for real user pain, which ones only create noise, and which charts nobody checks. If a signal never changes a decision, cut it.

Every signal needs one owner who decides whether the threshold is right, who gets paged, and what the first fix should be. If nobody owns an alert, the team will ignore it after the third false alarm.

Some teams get stuck because they already have too many dashboards and no clear starting point. In that case, a short outside review can help. Oleg Sotnikov at oleg.is works with startups and small businesses as a fractional CTO, and this kind of cleanup is usually less about buying more tooling and more about choosing the few signals that match real user pain.

A good first version is boring on purpose. Three or four signals, checked every week, beat fifty charts that nobody trusts.

Frequently Asked Questions

What should a small company monitor first?

Start with one money path, one background queue, and release health. For most teams, that means signup or checkout success rate, p95 latency on that flow, oldest job age in the queue behind it, and error rate after each deploy.

Why are CPU and memory charts not enough?

Because users can get stuck while servers still look calm. A payment page can slow down, an email can arrive late, or a worker can loop on retries without a big jump in host metrics.

How many charts should the first dashboard have?

Keep it small: about five to seven charts. If one person cannot scan it in under 30 seconds and answer whether users can finish the main action, cut more.

Which user flow should we track first?

Pick the path tied to revenue, activation, or retention. Good first choices are signup, login, checkout, first upload, or report delivery. Choose the one that hurts the business fastest when it breaks.

What queue metrics matter most?

Watch queue depth, oldest job age, full job time from creation to finish, and retry or failure rate. Oldest job age usually tells the story fastest because it matches the delay users feel.

How do we spot a bad release quickly?

Put deploy markers on the same charts your team already watches. Then compare error rate, latency, job wait time, and the success rate of one business flow for the 15 to 30 minutes after each deploy.

What alert thresholds should we start with?

Use thresholds people can explain without reading docs. For example, page someone if signup success stays under 98% for five minutes, warn the team if the oldest queue job passes 10 minutes, and stop a rollout if errors jump right after a deploy.

Who should own the dashboard and alerts?

Give each signal one owner. That person keeps the threshold useful, decides who gets paged, and updates the alert when the product changes. During a release, split roles so one person watches charts, one ships, and one decides on rollback.

How often should we review our monitoring setup?

Run a short review once a week. Check whether users could sign in and pay, whether any queue stayed old, whether the last release moved errors or latency, and which alerts people ignored. Remove charts nobody used.

When does it make sense to get outside help?

Ask for help when you already have many dashboards but still cannot tell what hurts users, or when alerts fire all day and nobody trusts them. A short review from an experienced CTO can shrink the setup and make it easier to act on.