Mar 16, 2025·8 min read

Goroutine leaks in Go services: hidden patterns to fix

Learn how goroutine leaks in Go services hide behind normal metrics. Spot worker pool issues, channel mistakes, and missed cancels early.

Table of Contents

Why healthy services still leak

A Go service can return 200s all day and still leak goroutines. Users see normal pages, jobs keep moving, and dashboards may look calm. The leak grows off to the side because some work starts and never reaches an exit.

That is why goroutine leaks in Go services are easy to miss. A CPU spike gets attention fast. A leak usually stays quiet. It adds a little memory, keeps a timer alive, holds a socket open, or leaves one worker waiting on a channel that nobody will use again. One stuck goroutine rarely hurts. A few thousand will.

High load and stuck goroutines are different. Under real traffic, goroutine count goes up and then comes back down when traffic drops. Response time may get worse for a while, but the shape matches demand. A leak does not follow that pattern. The count keeps drifting upward, even at night, because some goroutines stay parked on a receive, a send, a lock, or a context that nobody cancels.

The garbage collector does not fix this for you. If a goroutine is still alive, its stack stays alive too. That often means objects tied to that goroutine stay in memory longer than you expect. So the service looks healthy from the outside while it gets heavier inside.

The first signs are usually plain:

memory grows a little every hour or every deploy
shutdown hangs because background tasks never stop
goroutine count rises over days instead of tracking traffic

You may notice smaller clues first. A restart takes longer than it used to. A test hangs once every few runs. A worker pool seems full when request volume looks normal. These are easy to dismiss because the service still answers requests.

That is what makes leaks annoying. Success responses only tell you the request path worked. They do not tell you whether every goroutine that started for that request actually finished.

Leak patterns that show up in worker pools

A worker pool often leaks in quiet ways. Requests still finish, CPU stays normal, and the service looks fine. Meanwhile, a few goroutines keep waiting on channels or retrying work that nobody will ever read.

One common pattern starts with workers that sit in a loop like for job := range jobs. That code is fine only if someone closes jobs at the right time. If the producer stops early, or shutdown skips the close, those workers wait forever. In small numbers, you barely notice. In a busy API, each deploy or config reload can leave another batch behind.

A simple example is a pool that handles webhook delivery. The service starts 20 workers on boot. Later, an error path replaces the producer, but the old jobs channel never closes. Traffic still moves through the new path, so nobody spots the old workers. This is how goroutine leaks in Go services slip past normal health checks.

When senders outlive receivers

The reverse problem is just as common. A receiver exits on error or timeout, but senders keep pushing jobs into the channel. If that channel is unbuffered, the sender blocks right away. If it has a buffer, the service keeps working until the buffer fills, then more goroutines pile up behind the send.

This gets worse when code tries to "fix" blocking with go func() { jobs <- item }(). That removes backpressure, but now every slow or dead worker can leave another goroutine stuck on send. A burst of traffic turns a small bug into hundreds of parked goroutines.

Retry loops that never settle

Retry logic can turn one failed job into a leak factory. A worker hits an error, sleeps, and tries again forever. If the downstream service stays down, that worker never returns to the pool. Some teams make it worse by spawning a new goroutine for each retry or by requeueing the same job without a stop condition.

Background workers can pile up in a similar way. A scheduler starts a pool every minute, on every tenant, or after every config refresh, but it never stops the old one. The service still answers requests, so the bug hides until memory rises or shutdown hangs.

A few checks catch most of these cases:

Make one owner responsible for closing the job channel.
Stop producers before workers exit, or signal both sides with the same context.
Set a retry limit or deadline for each job.
Never hide a blocking send inside a new goroutine.
Count active workers and watch whether that number returns to baseline after load drops.

Worker pools are useful, but they are not self-cleaning. If a pool has no clear start, stop, and retry rules, it usually leaks long before memory makes the problem obvious.

How channel misuse traps goroutines

Many goroutine leaks in Go services start with one blocked channel operation. The service still answers requests, dashboards still look fine, and the leak grows slowly in the background.

A send can block forever when nobody reads from that channel anymore. This happens a lot after a timeout, an early return, or a worker that stopped on error while producers kept sending. A buffered channel only hides the problem for a while. Once the buffer fills, every sender parks and stays in memory.

The other side is just as common. A goroutine that waits on <-ch or loops over for v := range ch needs a sender or a close. If neither happens, it never exits. One forgotten close in a fan-out path can leave dozens of readers stuck. You see this in request pipelines where one stage stops early but the next stage still waits for more work.

Three traps show up again and again:

A channel variable stays nil, so any send or receive on it blocks forever.
A producer returns without closing the output channel, and downstream goroutines keep ranging.
A select waits on cases that can no longer make progress.

Nil channels are especially sneaky. Inside select, a nil channel disables that case. That can be useful when you do it on purpose. It is painful when a channel was never initialized or got set to nil too early, because now one branch of your logic silently stops working.

Blocked select statements also create leaks when they miss an exit path. If a goroutine waits on results <- item or <-jobs but never listens for ctx.Done(), it can outlive the request that created it. The request ends, the caller leaves, and the goroutine keeps waiting for an event that will never come.

A small habit helps a lot: every goroutine should have a clear stop condition. For channel code, that usually means one of these rules: the sender closes, the receiver listens for context cancellation, or both sides stop through a shared shutdown path. If you cannot point to that exit, the goroutine probably has none.

Forgotten cancels and background work

A lot of leaks start with work that outlives the request that created it. The handler returns, the client is gone, but a child goroutine keeps polling, retrying, or waiting on I/O. The service still looks fine for hours because each leaked goroutine does almost nothing. Then traffic stays high, and thousands of them pile up.

One common cause is losing the request context on the way down. A handler gets r.Context(), but a helper starts work with context.Background() or never passes a context at all. That small break means the child task no longer knows when the request timed out, failed, or got canceled.

A simple example: an API request starts a goroutine to fetch data from two backends and write the result to a channel. If the client disconnects early and the goroutine ignores the original context, it may sit on a blocked send or keep retrying for minutes. You will not notice it in a quick test. You will notice it in production.

Timers and tickers create the same kind of slow leak. A time.NewTicker keeps firing until you call Stop(). A retry loop with a ticker inside a goroutine can stay alive forever if nothing stops it. time.After inside a loop can also add pressure because it creates a new timer each time.

Long-lived background tasks need a clear stop signal too. Cache refreshers, queue pollers, metric pushers, and cleanup loops should all watch ctx.Done() and return fast. If a goroutine has a for loop, check how it exits. If the answer is "it usually keeps running," that is a problem.

A few habits catch most of this early:

Pass the same context from the request into every child call that should end with that request.
Use context.WithCancel or context.WithTimeout, and always call the cancel function.
Stop every ticker and timer you create when the work ends.
Put case <-ctx.Done(): return in long loops that wait, poll, or retry.
During tests, compare goroutine counts before and after canceled requests.

Teams that run lean Go services learn this fast. One forgotten cancel can waste more memory than a slow query, because it never leaves.

A simple example from a real service

Fix Stuck Goroutines

Work through blocked sends, missed cancels, and fan out bugs with an experienced CTO.

Get Help

A common production bug starts with a request that fans out to a few helpers. Think of a Go API that builds one response from PostgreSQL, Redis, and an AI call. To keep the page fast, the handler starts three goroutines and waits for their answers.

The code looks harmless:

func handle(ctx context.Context, id string) (Resp, error) {
    results := make(chan Part)
    errs := make(chan error, 1)

    go loadProfile(ctx, id, results, errs)
    go loadCache(ctx, id, results, errs)
    go loadSummary(ctx, id, results, errs)

    var resp Resp
    for i := 0; i < 3; i++ {
        select {
        case part := <-results:
            resp.Merge(part)
        case err := <-errs:
            return Resp{}, err
        }
    }

    return resp, nil
}

The leak appears when one branch fails fast. Say loadCache hits a timeout and sends an error. The handler returns right away. That sounds fine, but the other two goroutines may still be running. If they later try results <- part, nobody is left to receive. They block and stay in memory.

This gets worse when the workers ignore cancellation. If the handler does not create a child context and call cancel(), those background calls keep waiting on the database, network, or model API even though the request is already dead. One bad path leaves two goroutines behind.

On a quiet service, nobody notices. A few blocked goroutines do not break anything. CPU stays normal. Success rate can still look good. That is why goroutine leaks in Go services often hide in systems that seem healthy.

Keep steady traffic on that endpoint for a few hours, though, and the math changes. If 2 percent of requests exit early and each one leaks two goroutines, the service keeps piling up stuck work. Memory climbs slowly. Garbage collection runs more often. Connection pools stay busy longer than they should. Then the service starts feeling random: a bit slower, a bit noisier, harder to trust.

That is the pattern to watch for: fan-out, early return, no cancel, and goroutines still waiting on a send.

How to find the leak step by step

A leak gets easier to fix when you turn it into a count, a stack trace, and a repeatable test. Guessing wastes time. A small routine helps you catch the exact place where goroutines stop making progress.

Start with numbers. Record the goroutine count at idle, then during load, then a few minutes after the load stops. runtime.NumGoroutine() gives you a quick view, and a goroutine profile shows the blocked stacks behind that number. In a service with Prometheus, Grafana, or pprof, this usually takes a few minutes to wire up.

If the count rises during traffic and does not fall back near baseline after idle time, you likely have a real leak. A healthy service may spike under load, but those goroutines should finish their work and disappear.

What to inspect first

Look at the goroutine stacks in groups. You will usually see the same blocked frame repeated many times. That pattern tells you where to dig:

workers waiting forever on a receive
senders blocked on a full channel
goroutines stuck in a select with no cancel path
HTTP calls, DB calls, or timers with no timeout
background loops that never stop on shutdown

Worker pools deserve extra attention. Check who owns the job channel, who closes it, and what tells workers to exit. If producers can keep sending after consumers stop, you get stuck senders. If workers wait on a channel that nobody closes, they sit there forever.

Then trace context flow. Every request-scoped goroutine should inherit a context, and code that creates context.WithCancel or context.WithTimeout should call the cancel function. Forgotten cancels often hide in retries, fan-out calls, and background helpers.

Add guards while you debug. Put timeouts on external calls. Add logs or metrics when workers start and stop. Write a shutdown test that starts the service, pushes some traffic through it, stops it, waits a few seconds, and checks whether the goroutine count returns close to baseline.

Run the same load again after each fix. Compare the count during idle, not only at peak traffic. If you still see extra goroutines after the service calms down, read the new profile and repeat. That loop is boring, but it finds leaks faster than staring at code for an hour.

Mistakes that create leaks

Pressure Test Your Service

Run through load, idle, and restart behavior before leaks turn into incidents.

Plan Review

Most goroutine leaks in Go services come from ordinary code, not strange edge cases. The service looks healthy, requests still pass, and logs stay quiet. Meanwhile, a few stuck goroutines pile up every hour until memory and file descriptors start to creep.

The first mistake is ownership. If you start a goroutine, some part of your code must own its full life cycle. That owner should know why the goroutine exists, when it should stop, and what signal tells it to exit. If nobody owns it, cleanup turns into wishful thinking.

A common example is request code that starts background work "just for now". The request ends, but the goroutine keeps waiting on a channel or retrying a call. After enough traffic, you have hundreds of leftovers from jobs nobody watches anymore.

Database and network calls cause another quiet leak. If you skip context propagation, a slow query or hung HTTP call can keep a goroutine blocked far longer than the request that created it. You do not need a total outage for this. One bad dependency, a tiny timeout bug, or a dead TCP connection is enough.

Background retry loops deserve extra suspicion. Many teams write a loop that retries forever with a sleep, then forget to add a stop condition. That code feels safe because it "keeps trying," but it often traps a goroutine for the life of the process. Put limits on retries, backoff, or both, and let context end the loop.

Tickers and timers look harmless, but they leak more often than people admit. A ticker started in a worker, cache refresher, or metrics loop will keep firing until you stop it. If the goroutine exits without calling Stop(), you leave work behind. If the goroutine never exits, the ticker keeps it alive.

A short checklist helps:

Give each goroutine one clear owner and one clear shutdown path.
Pass context.Context into DB queries, HTTP calls, and any blocking operation.
Cap retries in background loops, or stop them when context ends.
Stop every ticker and timer in the same code path that created it.

I would treat any go func() line as a small liability until the exit path is obvious. That habit catches more leaks than fancy tooling, and it catches them early.

Quick checks before memory climbs

Stabilize Production Go

Get outside CTO help for Go services that feel healthy but keep getting heavier.

Talk To Oleg

A leaky service often looks fine during steady traffic. The trouble shows up after a burst, when requests fall off but goroutines do not. That is one of the fastest ways to spot goroutine leaks in Go services before memory starts to drift up.

Run a short load test, then let the service sit. If the goroutine count stays high minutes later, something is still waiting on a channel, a timer, a network call, or a context that never ends.

A few checks catch a lot of leaks early:

Watch whether the goroutine count drops back near its starting level after traffic falls.
Stop the service and see if shutdown finishes quickly.
Read each worker loop and confirm it has one clear exit path.
Check every spawned task for a timeout or a cancel path.

Shutdown speed tells you more than many logs do. A healthy service usually stops fast because workers exit when their context closes, queues stop accepting work, and background loops return. If shutdown hangs, some goroutine still waits for input that will never come.

Worker code should be boring. That is a good sign. A worker reads from one place, handles one job, and exits when the channel closes or the context ends. When a worker can leave in five different ways, people forget one. That is where leaks hide.

Timeouts deserve the same level of discipline. If you start a goroutine for a database call, webhook, retry loop, or cache refresh, give it a deadline. If you create a context with cancel, call the cancel function on every path, not only on errors.

A simple test helps: send a burst of jobs, wait for the queue to drain, then check whether the process settles down. In a healthy service, counts flatten out. In a leaky one, they keep stacking a little higher after each burst. That slow climb is easy to miss in production and cheap to catch in staging.

If one of these checks fails, pause and fix it before adding more traffic. Leaks rarely stay small.

What to do next

Treat leak prevention like part of release work, not a cleanup task for later. Most goroutine leaks in Go services stay quiet for a while, then show up as rising memory, slower restarts, or pods that never settle after traffic drops.

Add one habit to your tests: start the code, stop it, and check that the goroutine count returns close to baseline. Run the same check in staging after load tests. A service can look fine under traffic and still leave workers blocked on channels, timers, or retry loops after the request path ends.

Track goroutine count next to memory and latency. Those three numbers tell a better story together than any of them alone. If memory climbs slowly but goroutines jump after each deploy, you probably have background work that never shuts down.

A short checklist is enough:

Run leak-focused tests in CI for worker pools, retries, and shutdown paths.
Put goroutine count on the same dashboard as memory, CPU, and latency.
Alert on steady growth over time, not only on a hard limit.
During incident follow-ups, review how pools stop, who closes channels, and where contexts end.
Test restart behavior, because many leaks appear only when code starts and stops more than once.

Keep incident reviews specific. Do not stop at "we saw too many goroutines." Write down which goroutine started, what it waited on, and what signal should have stopped it. That gives the next engineer a fixable problem.

A small team often misses the same leak twice because nobody steps back and reviews the pool design itself. That review matters. A worker pool that looks tidy in code can still hide stuck sends, unbounded retries, or one missing cancel.

If your team wants outside help, Oleg Sotnikov offers practical CTO advice for Go services and production operations. His work spans Go, infrastructure, observability, and lean production systems, so he tends to focus on the boring parts that usually cause the leak: shutdown flow, ownership of channels, and background jobs that outlive their request.

Make every goroutine easy to start, easy to see, and easy to stop. When your team checks that in tests and staging, leaks stop being a surprise.