Post-launch infrastructure review for a growing product
A post-launch infrastructure review helps teams catch slow deploys, weak backups, alert noise, and pool limits before growth turns small issues into outages.

Why launch week can hide weak spots
Launch week brings unusual traffic. Friends, early adopters, and curious visitors often hit the same pages at the same time. A week later, real usage spreads across dashboards, search, exports, background jobs, and mobile sessions at odd hours.
That shift matters more than many teams expect. A system can look fine during the first rush, then start bending under a slower, wider, messier load. The first week tests excitement. The next few weeks test habits.
A delay of 200 milliseconds rarely scares anyone on its own. Put that delay on login, on every API call, or on a page that loads five widgets, and people feel it fast. The app still works, but it starts to feel sticky.
Users usually do not file a ticket for that right away. They blame their browser, their Wi-Fi, or bad timing. Meanwhile, those tiny slowdowns stack up on the paths people use every day, and the team misses the pattern.
Growth also eats the quiet hours. Night backups, long reports, and heavy deploys may fit neatly when only a few users are online. After a few months, those safe windows get smaller or disappear. Jobs start to overlap, and the database stays busy for longer stretches.
Teams usually spot product bugs before they spot infrastructure drift. A broken button is obvious. A cache that misses a little more each day, or a connection pool that gets tight only during batch work, can sit there for weeks. That is why this review should happen early, while the problems are still cheap to fix.
Start with the numbers that changed since launch
Put launch week next to the last two to four weeks and compare them side by side. This review works best when you start with change, not guesses. If traffic doubled but error rates stayed flat, that tells you one story. If traffic rose only 20% and retries tripled, that tells you another.
Do not lean too hard on daily averages. They smooth out the exact spikes that hurt users. Check peak hours, the busiest background job windows, and the minutes right after a deploy. A system can look calm over 24 hours and still struggle every day at 9:00 AM.
Focus on numbers that map to user pain and team pain: page load time on your busiest screens, failed jobs and queue delays, database latency and connection waits, and growth by service rather than only for the product as a whole.
Write down anything that got slower, noisier, or more expensive since launch. That usually means one or two pages now lag, a worker retries too often, or one service grows much faster than the rest. Teams often miss this because they look at total usage while a smaller service quietly carries most of the new load.
A simple example: signups may rise by 30%, but email jobs can jump by 80% if users trigger more reminders, invites, and resets than expected. That mismatch is where cracks start.
If you already use dashboards in tools like Grafana or error tracking in Sentry, keep the same charts for each review. Consistent charts make trends easier to spot and save time.
Check caches before misses pile up
Spend real time on cache behavior. Caches can hide slow queries and heavy app work for weeks, then fail all at once when traffic grows.
Start with the pages or API routes that get the most traffic. If your pricing page, dashboard, or search results serve most requests, compare their cache hit rate now with what you saw right after launch. Even a small drop can create a big load increase because every miss pushes work back to the database or app server.
Poor cache behavior usually shows up in plain signals: hit rate drops on busy routes, entries expire sooner than users revisit them, memory stays near the limit, and evictions jump during normal traffic instead of only at peaks.
Short TTLs often cause the trouble. Teams set safe expiration times early, then forget them. If a page changes twice a day, a 60-second TTL is usually wasteful. You keep rebuilding the same response for no good reason.
Memory pressure matters just as much. Watch cache memory use during your busiest hour, not just the daily average. If memory sits near the ceiling and evictions spike, the cache stops acting like a speed layer and starts acting like a lottery.
Also test a cold cache on purpose. Restart the cache in staging, or clear part of it, then watch response time, database load, and queue depth. That test shows whether the product can recover cleanly after a deploy, failover, or unexpected restart.
Small SaaS teams often miss this because the app feels fast in normal use. Then one cold start after a release turns a 200 ms page into a 4-second wait and pushes the database to its limit. That is usually fixable, but only if you catch it before traffic doubles again.
Review connection pools and query pressure
Traffic growth rarely breaks a database all at once. More often, requests wait for a free connection, background jobs pile up, and page speed turns uneven. This is often the first hard limit a review uncovers.
Start with open connections during your busiest hour, not the daily average. Compare web workers, job runners, scheduled tasks, and admin scripts to the pool size each service can use. Teams often set a sensible default early, then add more workers later and forget that every worker wants its own share of the pool.
A small mismatch can cause real slowdown. If 20 app workers, 10 job runners, and a migration task all hit the same database, a pool cap that looked fine at launch can turn into waits, timeouts, and retry storms.
Then look for slow queries that hold connections too long. A query does not need to fail to hurt you. If it scans too much data, sorts a large result, or waits on a lock, it can tie up a connection long enough to block other work. Check query logs and database metrics for spikes at peak traffic, after imports, and during reports or batch jobs.
Deploys need their own check. A release may start extra workers, run migrations, warm caches, or replay queued jobs. That short burst can push pool pressure higher than normal traffic. Watch connection use during deploy time, not only during steady traffic.
If you find pressure, fix the simplest mismatch first. Lower worker counts, split noisy jobs, raise pool limits with care, or rewrite the worst query. Small changes here often remove the random slowness users notice long before an outage forces everyone to pay attention.
Make sure backup windows still make sense
A backup plan that worked at launch can turn into a problem once the product grows. More users usually mean more rows, more files, and longer backup jobs. Check the schedule again before a quiet slowdown turns into a real outage.
Time full and partial backups on a normal weekday, not during a staged test with almost no traffic. Measure how long each job takes and watch what happens to CPU, disk I/O, network use, and replica lag while it runs. A job that used to finish in 12 minutes can creep up to 45 without anyone noticing.
Trouble usually starts when backups overlap with active customers. Pages slow down, queries wait longer, and queues begin to pile up. This is common with products that now serve users across several time zones, because the old "nighttime" window may no longer be quiet.
Backup success alone is not enough. A green status only tells you files were written somewhere. It does not tell you whether the team can restore the database fast enough after bad data, an accidental delete, or a failed release. Test a small restore and a full restore, then record how long it takes to get the app usable again.
Keep ownership clear. Pick one person to run restore drills, put those drills on a real schedule, write down where backups live and what access details are needed, and save the last restore time along with any problems you found.
If nobody owns restore tests, they usually do not happen. Then the first real restore becomes the test, and that is when small backup mistakes get expensive.
Cut deploy time before releases stack up
Slow releases create a backlog fast. One bug fix waits for another, then the team starts batching changes together. That makes each deploy riskier, and a simple fix ends up stuck behind work that has nothing to do with it.
Time the release in parts instead of treating it as one number. A team may say, "deploys take 25 minutes," but that hides the real problem. Build time, test time, and rollout time usually slow down for different reasons.
A typical breakdown is simple: maybe eight minutes to build artifacts, 11 minutes to run tests, four minutes to roll out, and two extra minutes waiting for manual approval. That last part matters more than people think. Manual steps block tiny fixes, especially when the person who knows the process is away or asleep.
Look for handoffs, copy-paste commands, one-off checks in a private note, and anything that makes a release depend on one person remembering the next step.
Schema changes deserve extra attention. Database migrations often turn a quick release into a slow one, or force deploys into off hours. If a migration locks tables, rewrites too much data, or needs careful operator timing, the team will delay shipping. Split risky migrations into smaller steps when you can, and make app changes tolerate both old and new schema for a short period.
Rollback steps also need to stay short enough to use under stress. If people need to read six pages, guess command order, or choose between three paths, rollback will be slow when it matters. Define who can trigger it, list the exact commands, note what to check after rollback, and write down how long it usually takes.
Teams that ship often do not need heroic release nights. They need boring deploys that finish fast and fail in a way people can reverse in minutes.
Fix alerts people already ignore
A noisy alert system trains people to look away. If the team gets 60 alerts a day and only two need action, the next real problem will blend in with the rest.
Count alerts by day, by service, and by person. One team member might get paged for database issues, deploy failures, error spikes, and disk warnings, while another gets nothing. That imbalance usually means the rules grew faster than the process.
For each alert, ask four plain questions. Did someone act on it the last few times it fired? Does it point to user pain, like failed logins or slow requests? Does it repeat the same event another tool already reported? Does the right person receive it at the right hour?
If nobody takes action, remove the alert or send it to a dashboard instead of a pager. A warning that stays open for weeks teaches the team that red does not mean urgent.
Put user impact first. Alerts for checkout failures, API error spikes, queue backups, and broken deploys should stay loud. Minor CPU swings or short-lived retries can stay visible without waking anyone up.
Duplicate alerts are common after growth. One database slowdown can trigger app errors, latency alarms, queue lag notices, and uptime checks at the same time. Group those into one incident so the team sees one problem, not five versions of it.
On stacks that use Sentry, Grafana, Prometheus, and Loki together, this overlap happens fast. Clean alerting should feel almost boring. When the pager goes off, someone should know why it matters within a few seconds.
Run the review step by step
Use one recent busy day, not a weekly average. A single loaded day shows where the product bends under real traffic, while averages hide the rough parts. Pick a day with real customer activity, a deploy, and normal background jobs.
Move through the same areas in the same order every time. That keeps the review focused and stops the team from jumping straight to the loudest graph.
Start with caches and look for hit rate drops, rising miss spikes, and slow rebuilds after deploys. Then check connection pools and see whether requests wait too long for a free connection or whether the database stays under steady pressure. After that, review backups and confirm they still run in a quiet window without colliding with peak usage. Measure deploy time across the full path from merge to healthy production, not just build time. End with alerts and find the ones that fire often, get ignored, or wake people up for noise.
For each area, write down three things: the problem, the user impact, and the fix. Keep it plain.
Mistakes that waste review time
A lot of teams lose hours on the wrong problems. They chase the weird spike from last Tuesday, then miss the slow daily drift that keeps showing up every afternoon. Rare events matter, but repeated patterns usually tell you where the next incident will start.
Averages cause the same problem. A database can look calm across the whole day while one ugly hour does all the damage. If requests pile up from 2 to 3 PM, the daily average hides it. Check the worst hour, not just the nice-looking summary.
Backups fool people too. A green "backup completed" status only proves that a job ran. It does not prove you can restore data fast enough, or restore it at all. Teams often find this out at the worst time, when a restore needs extra steps, missing secrets, or more disk space than anyone planned for.
Alerting gets messy for a different reason. People keep every alert because one might matter someday. Then the inbox fills up, phones buzz all night, and the team learns to ignore noise. An alert that nobody trusts is dead weight.
Big redesign plans can waste even more time. If deploys take 18 minutes, caches miss too often, and a connection pool runs hot, fix those first. Small changes often buy weeks or months of breathing room: tune pool limits after checking real query load, trim alert rules that never lead to action, test one restore end to end, and cut slow deploy steps that add no safety.
Most wins here are boring. That is fine. Boring fixes prevent loud outages.
Example: a small SaaS team after a growth spike
A B2B SaaS team gets a nice surprise: signups jump after a partner mentions the product in a newsletter. Traffic holds, nothing crashes, and everyone assumes the stack handled it well. Two weeks later, support tickets say the dashboard feels slow for returning users.
The review shows why. New users load fresh data, but returning users should hit cache. Instead, the cache hit rate drops after a recent feature adds more user-specific queries and shorter cache lifetimes. The app still works, yet each miss adds more database reads, and the slowdown shows up first on the pages people open every day.
Backups look fine on paper too, until the team checks where users actually log in from. Nightly jobs run during off hours for the home office, but those jobs now overlap with active customers in another time zone. Disk activity spikes, pages take longer to load, and the pattern looks random until someone compares backup times with response time charts.
Then a bug slips into production during a busy afternoon. The fix is simple, but deploys take 18 minutes because the pipeline runs heavy steps one after another. That delay feels much longer when users wait and errors keep coming in.
At first, alerts do not help. The team gets so many low-value pings that they ignore half of them. After they clean up alert noise, one signal finally stands out: database connections keep hitting the pool limit when cache misses rise and backups run.
That is the real issue. The team adjusts connection pool limits, trims expensive queries, moves backups to a better window, and shortens deploy time. After that, the product feels stable again, not just lucky.
Quick checks before the next traffic jump
Traffic spikes rarely create new problems. They expose small limits that looked fine at normal load. End the review with a short pass over the parts that usually fail first.
Watch cache hit rate during your busiest hour, not the daily average. Leave headroom in connection pool limits so background jobs, admin access, and one unexpected task do not push the pool to the edge. Check when backups start, how long they run, and what else runs at the same time. Time a normal deploy from merge to stable production. Review alerts with the person who gets paged.
These checks are simple, but they force honest answers. Maybe the cache still helps, but only until one popular page goes cold. Maybe the pools look fine, but they leave no room for a migration or a support session. Maybe backups finish, but they slow the app for 20 minutes every morning.
That is usually enough to set priorities. Fix the issue that hurts peak traffic first, then the one that slows releases, then the one that wakes people up for no reason. Alert cleanup can wait a day. A pool that blocks logins cannot.
What to do next with the findings
A review only helps if it turns into a short, ranked plan. Start with the issue users feel first. If cache misses make pages slow, fix that before you spend a day trimming backup logs. If deploys take 40 minutes and block urgent fixes, move that higher.
Keep the first pass small. Pick one problem to solve this week, one or two to schedule, and a few to watch. Teams get stuck when they try to clean up everything at once.
Turn notes into a repeatable routine
Some checks should become part of the normal calendar, not a one-time cleanup. Run a restore drill every month, not just a backup check. Measure deploy time each month and note where it slows down. Keep a short list of limits to recheck after growth, such as cache hit rate, connection pool limits, backup duration, and alert volume. Revisit that list after traffic jumps, a pricing launch, or a new large customer.
This also helps during stressful weeks. Instead of arguing about what matters, the team can compare fresh numbers with the last review and see what changed.
If your team keeps debating priorities, outside help can speed things up. Someone like Oleg Sotnikov at oleg.is can review the stack, rank the work, and keep the scope tight, especially when the team is busy shipping and does not want a big rewrite.
Write the outcome down in plain language. One page is enough: the problem, the user impact, the limit you will watch, the owner, and the date for the next check. That page is usually more useful than a long audit document nobody opens again.
When the next traffic jump hits, you want a tested restore, a deploy that still finishes fast, and a short list of limits to check before small cracks turn into an outage.
Frequently Asked Questions
Why can a product feel fine at launch and slow down later?
Launch traffic often hits the same few pages at the same time. A few weeks later, real users spread across dashboards, exports, background jobs, and mobile sessions. That wider pattern exposes slow queries, tight connection pools, and weak caches that launch week never touched.
What numbers should I compare after launch?
Compare launch week with the last two to four weeks. Look at busiest hours, not daily averages. Check page speed on busy screens, failed jobs, queue delays, database latency, connection waits, and cost by service so you can see what actually changed.
How do I know the cache is becoming a problem?
Watch the cache hit rate, evictions, memory use, and response time on your busiest routes. If hit rate drops, entries expire before users come back, or evictions jump during normal traffic, the cache stopped saving enough work.
Should I test the app with a cold cache?
Yes, you should. Clear part of the cache in staging and watch page speed, database load, and queue depth. That test shows whether one restart or deploy will turn a fast page into a slow one.
How do connection pools create random slowness?
Every worker wants database connections. When web processes, job runners, and admin tasks all ask at once, requests wait instead of running. Users feel that as uneven speed, timeouts, and random retries.
When do backups start affecting production?
Backups start hurting when they overlap with active users or take much longer than before. Time them on a normal weekday and watch CPU, disk I/O, network use, and replica lag. Then run a restore test, because a green backup job does not prove recovery will go smoothly.
What is the best way to measure deploy time?
Break deploy time into build, test, rollout, and manual wait time. That makes the real bottleneck obvious. Teams often find that approvals, schema changes, or slow rollback steps cause more pain than the build itself.
Which alerts should wake someone up?
Page someone for user-facing failures, rising API errors, queue backups, broken checkout flows, and failed deploys. Send short CPU spikes and repeat noise to dashboards instead. If nobody acts on an alert, remove it or rewrite it.
How often should we do this review?
Run it after noticeable traffic growth, a feature launch, a pricing change, or a new large customer. Even without a trigger, a monthly check works well for most SaaS teams because small drifts add up fast.
What should we fix first if we find several issues?
Fix the issue users feel first. If cache misses or pool waits make pages slow, handle that before you tune backup logs or clean up dashboards. Most teams do not need a rewrite here; they need a short ranked plan and one solid fix this week.