Mar 02, 2025·8 min read

Go to Rust for one performance hotspot, not the whole app

Go to Rust for one performance hotspot only after you measure the bottleneck, price the learning curve, and set a clear rollback plan.

Table of Contents

Why one slow path can trigger rewrite talk

Users do not judge your system by average speed. They judge it by the moment that makes them wait, refresh, or give up. One slow path can poison the whole product, even if 95% of requests finish fast.

You see this a lot in file processing, search, exports, video work, pricing engines, and batch jobs. A customer clicks one button and waits 12 seconds. Support hears about that delay every week. The database, API, admin panel, and billing flow may all be fine, but that one path becomes "the app is slow."

That is why rewrite talk starts early. A team sees one painful bottleneck and jumps to a bigger story: "Maybe Go is the problem. Maybe we should move everything." That reaction is understandable, but it is usually too big. Most of the platform may already run well enough. Rewriting healthy parts often buys very little.

A full rewrite has a real price:

new bugs in places that used to be stable
months of delay before users feel any gain
stress on the team while two stacks overlap
slower feature work while engineers learn and rebuild

That cost is easy to ignore when one hotspot hurts. Still, replacing an entire platform because one path is slow is like rebuilding a house because one room gets too hot. Fix the room first.

This is where Rust enters the conversation. If one function or service burns CPU time, memory, or cloud spend, Rust can be a smart bet for that slice alone. The idea behind "Go to Rust for one performance hotspot" is not to chase purity. It is to contain risk. Keep the parts that already work. Move only the path that blocks users or eats money.

That approach also gives you a cleaner decision. If the hotspot speeds up enough, keep the mixed setup. If the gain stays small, roll it back without dragging the whole company through a rewrite. A narrow fix is often the adult choice, even when the engineering brain wants a fresh start.

Measure the hotspot before you choose Rust

Most rewrite talk starts too early. A slow screen or a busy CPU graph does not mean Go is the problem. First, prove that one part of your code burns enough time or memory to justify the switch.

Start with real production data. Pull traces for the exact endpoint, worker, or batch job that hurts users most. Then check CPU profiles, heap profiles, allocation counts, and memory growth under normal load. A tiny benchmark on a laptop can mislead you fast.

Averages hide the pain. If a request looks fine at 120 ms on average but jumps to 1.8 s at p99, users still feel it. Chart p95 and p99 for the slow path, not just p50. Tail latency often comes from bursts, retries, garbage collection, or one bad code path that only shows up under pressure.

You also need to split app time from waiting time. If your handler spends 70% of its life waiting on PostgreSQL, Redis, or another service, Rust will not rescue it. The same goes for network round trips. Measure how much time your Go code actually runs on CPU versus how much time it sits idle.

If you already run Prometheus and Grafana, this is usually easy to see. If you do not, logs with timestamps still give you a decent first cut. Either way, write one baseline table before anyone ports a line of code:

the endpoint or job name
p95 and p99 latency
CPU time and allocations per request
memory use at steady load and at peak
database and network wait share

Keep that table simple and frozen. It gives the team one reference point when opinions start flying.

A small example makes this clearer. Say a Go image processing worker uses 85% CPU during resize and color conversion, while database time stays near zero. That is a clean candidate for a Rust trial. But if your API spends 900 ms waiting on queries and only 40 ms in Go code, a Rust port is mostly theater.

If you cannot point to one hotspot with real numbers, do not switch languages yet. You are still guessing.

Check whether Rust fits this job

Rust earns its keep when the slow part is local, repeatable, and heavy on CPU time. Think tight loops, binary parsing, compression, codecs, encryption, image work, or a parser that runs thousands of times per minute. In cases like that, the code spends most of its time doing actual computation, so a faster language can make a real dent.

Rust will not rescue time spent waiting. If your service stalls on a slow SQL query, a cache miss, disk I/O, or a remote API call, rewriting that path in Rust usually changes nothing users can feel. You just move the same wait into a different language.

A good candidate has a narrow edge. One function or worker takes clear input, does a lot of work, and returns clear output. That is the sweet spot for a mixed Go and Rust service, because you can swap one part without pulling apart the whole app.

Ask four plain questions before you touch code:

Does this path burn CPU, not wait on the network or database?
Can you describe the input and output in one sentence?
Can you test it with saved data outside the full service?
Can the Go version stay in place as a fallback?

If any answer is no, pause. Rust gets harder to justify when the boundary spreads across handlers, database models, queues, and shared app logic. That kind of rewrite grows fast, and the benchmark often gets muddy.

A small parser is a good bet. A checkout service that calls three APIs is not. An image resize worker may be worth a trial. A report page slowed by one bad query needs query work first.

This is where teams often need a blunt opinion. Oleg Sotnikov has spent years fixing performance problems at the architecture level, and the pattern is usually the same: first isolate the hot path, then test one small Rust slice, then keep or drop it based on real numbers. If the job is not narrow enough to benchmark on its own, it is probably not the right Rust target yet.

Count the learning cost honestly

A faster module can still be a bad bet if the team pays for it every week after launch. The real cost is not learning Rust syntax. It is the time people need to build, review, debug, deploy, and support a mixed Go and Rust service without slowing down the rest of the quarter.

Start with people, not code. If one engineer has shipped Rust in production and two others can review it with confidence, the risk is manageable. If nobody has done that before, expect slower reviews, more cautious releases, and more time spent reading compiler errors than benchmark charts.

The wiring work adds up fast. A small Rust library inside a Go app sounds neat, but someone still has to handle bindings, memory boundaries, build scripts, packaging, and CI changes. If your team already has a clean pipeline, this may take days. If your builds are already fragile, it can take much longer.

Debugging also gets harder. When a bug crosses the Go and Rust boundary, the person on call needs to know where to look first. That matters more than people admit. A 30 percent speed gain loses some shine if a 2 a.m. incident now takes twice as long to diagnose.

Use a simple estimate before you approve the trial:

who can write Rust now
who can review Rust now
how many build and test changes the integration needs
who owns on-call after release
what work slips if this experiment takes two extra weeks

Be blunt about focus. If the team is already busy with a launch, a migration, or overdue reliability work, this experiment may cost more than it returns. A narrow performance win is easiest to justify when one engineer can isolate the hotspot, ship it behind a flag, and support it without pulling half the team into a new toolchain.

This is where outside help can change the math. If you can borrow review support from someone who already runs mixed stacks in production, the learning cost drops a lot. If you cannot, count the full training burden up front and treat it as part of the benchmark, not as a side note.

Run the trial in small steps

Review Your Rollout Plan

Set flags, fallback paths, and stop rules before you send live traffic to Rust.

Plan Rollout

Pick one unit that hurts in a measurable way. Good candidates are a parser, an image worker, a ranking loop, or a compression step. The unit needs a hard number problem, not a vague feeling that "Rust is faster."

Write the pass line before anyone opens an editor. For example: cut CPU by 30%, keep memory flat, and add no new error cases. If you cannot name the win in numbers, the trial can turn into a hobby project.

Keep the shape of the service the same. If you go to Rust for one performance hotspot, keep the Go interface and swap only the slow unit behind it. That lets you test the bet without changing routing, storage, auth, or deployment all at once.

A small trial usually works best like this:

freeze the current Go version and record baseline metrics
port one function or one worker only
feed both versions the same real inputs
compare CPU time, memory use, output correctness, and failure rates
track developer hours as carefully as runtime numbers

Use data that looks like production. Toy samples hide the ugly parts: long strings, bad inputs, spikes, and weird edge cases. If the slow path handles uploaded files, test real file sizes. If it parses events, use a messy event batch from actual logs with sensitive fields removed.

Developer time matters more than teams admit. A Rust benchmark plan can look great on a chart and still fail if two engineers spend three weeks learning lifetimes, FFI, and build tooling for a tiny gain. Count setup time, review time, debugging time, and the extra work for CI.

A simple example: a Go service spends 40% of its CPU budget in a checksum worker. The team ports only that worker to Rust, keeps the same Go call site, and tests a day of real payloads. If CPU drops by 35%, memory stays close, and the team can support the code, keep it. If the gain lands at 8%, roll it back and move on.

Plan the rollback before rollout

If you go to Rust for one performance hotspot, treat the first release as a reversible trial. Production traffic is messy, and lab results can look better than real life.

Keep the Go path running until the Rust path proves itself with live load. Do not delete working code just because the benchmark looked good on a staging box. A mixed Go and Rust service is easier to carry for a while than a rushed full switch that traps your team.

A simple flag makes this safe. It can sit in config, your deploy system, or a small routing layer, as long as the team can flip it fast without a code change. The flag should do four jobs:

send a small share of traffic to Rust first
record which path handled each request
fall back to Go when Rust errors spike
let ops turn Rust off in minutes

That flag matters more than the benchmark chart. When latency jumps, users do not care that the new path is written in a faster language. They care that the page hangs or the API times out.

Watch the numbers that affect the business after release, not just CPU time. Track p95 or p99 latency, crash rate, and host cost. Then watch the hidden cost too: slower builds, harder deploys, noisy logs, more pager alerts, and longer debugging sessions.

Write the stop rule before rollout. For example, if Rust cuts p99 latency by less than 10% over two weeks, or if crash rate rises, or if on-call work grows enough to erase the gain, switch traffic back to Go. Pick the threshold early so nobody moves the goalposts later.

This saves arguments when the result is only marginal. Teams often keep a change because they already spent time learning FFI, packaging, and debugging across language boundaries. That is a bad reason to keep it.

Rollback should feel boring. One switch, one deploy, clear logs, done. If reverting is painful, the experiment was too risky from the start.

A simple example of a good bet

Keep Build and Deploy Simple

Get help adding Rust without turning CI, packaging, and on-call into a weekly problem.

Review Stack

A Go API starts to run hot under load. The team profiles it and finds one clear problem: about 40% of CPU time goes into JSON validation before the request reaches the business logic.

That result changes the conversation. They do not touch routing, auth, handlers, database code, or deployment. They port only the validator to Rust.

The boundary stays narrow on purpose. The Go service sends raw payload bytes to the Rust module and gets back a simple result: valid, invalid, and a small set of error codes. That keeps the mixed Go and Rust service easy to reason about. It also cuts down the amount of glue code, which is where these experiments often get messy.

This matters more than people think. If the Rust code needs to know about Go structs, request context, logging, and half the app's data model, the "small test" stops being small.

The team then runs benchmarks on the same workload they already use for the Go service. CPU use drops by 28%, which is a real gain. But total request time only drops by 6% because validation was just one part of the full request path.

That kind of result is not disappointing. It is honest.

A small speedup in end-to-end latency can still be worth keeping if it solves a real problem. Maybe the service is CPU-bound during peak hours, so a 28% CPU cut means fewer instances. Maybe the API sits on a hot path where 6% helps keep tail latency under an internal target. In those cases, the Rust change earns its keep.

If users will not notice the difference and the hosting bill barely moves, the team should be willing to remove it. Because the boundary is tight, rollback is simple: switch the Go service back to the old validator, keep the benchmark notes, and move on.

That is what a good bet looks like. One hotspot, one narrow module, one clear measurement, and one decision rule before rollout. The team does not chase the idea of a full rewrite. They test whether Rust pays for this specific job.

A rough version of the decision can fit on one page:

Keep the Rust module if it lowers cost, protects latency goals, or frees enough CPU to delay scaling.
Drop it if the gain stays stuck in benchmarks and does not matter in production.
Keep the interface small so either choice stays cheap.

That is usually better than rewriting a whole app just because one function is slow.

Mistakes that skew the decision

A Rust trial goes off track when the team measures the wrong thing. A function can look 8x faster in a microbenchmark and still change nothing users notice. If request time drops from 820 ms to 800 ms because the database still does most of the work, the port did not fix the real problem.

The idea behind "Go to Rust for one performance hotspot" only works when the test stays narrow enough to trust. Teams often port several parts at once and ruin the comparison. If parsing, validation, and caching all change in the same sprint, nobody can tell which change helped, which one hurt, or whether Go was the problem at all. A mixed Go and Rust service is easier to judge than a half rewritten codebase.

Another miss is forgetting the cost of crossing between Go and Rust. Data has to move across that boundary somehow, and that can mean extra copies, marshaling work, and annoying bugs. This hurts most when the hotspot handles lots of small requests. A Rust routine that saves 15 ms on paper can give most of that back if every call spends 10 ms packaging and unpacking data.

Teams also underrate the human cost. Review time usually grows before runtime drops. Engineers read Rust more slowly, ask more questions, and spend longer on error handling, ownership rules, and unfamiliar tooling. The build can get messier too if CI now needs extra compilers, new containers, or separate checks for each target.

A trial is not a win just because a benchmark chart looks good. The ops team has to live with it. If deploys get harder, logs get murkier, or crash reports land inside an FFI layer that nobody wants to debug at 2 a.m., the result is weaker than it looked in the test. Let the people who handle incidents run it in staging, monitor it, and roll it back at least once.

One small example makes this clear. Say a team ports an image resize path to Rust and gets a 30 percent speedup in that function. Sounds great. But if upload time, object storage, and response encoding still dominate the request, users may not feel any change. Meanwhile, reviews take twice as long and packaging the service gets more brittle.

A fair bet stays boring. Change one hotspot, measure user impact, count review and build time, and wait until operations says the service feels normal. If the gain stays marginal, keep the Go code and drop the experiment.

Quick checks before you say yes

Profile Before You Port

Review p99 latency, CPU, and memory first so the team stops guessing.

Start Review

Go to Rust for one performance hotspot only when the target is painfully clear. If nobody can name the slow path in one sentence, the scope is still too fuzzy. "Image resize worker hits 95% CPU during upload spikes" is clear. "The app feels slow" is not.

Next, get baseline numbers from real traffic. Pull p95 or p99 latency, CPU time, memory use, queue depth, and error rate for that one path. Local tests help, but they do not tell you how the code behaves under normal load, bad inputs, retries, and peak traffic.

A quick yes or no pass usually saves weeks of drift:

Can the team point to one endpoint, one worker, or one loop?
Do you have before numbers from live requests, not guesses?
After launch, can one engineer own the Rust code without panic or delay?
Can you switch the Rust path off in minutes with a flag or route change?
Will the gain change user wait time, cloud spend, or peak capacity in a way that matters?

If one of those answers is no, stop and tighten the plan. A mixed Go and Rust service adds friction even when the Rust code is small. The learning cost is rarely syntax. It is build tooling, debugging, profiling, deployment, on-call support, and the quiet fact that fewer people on the team may want to touch it.

Rollback matters more than most teams admit. If the Rust version ships and the speedup is tiny, you need a clean exit. Keep the old Go path alive for a while, put the new path behind a flag, and log both versions during the trial if you can. That makes the decision boring, which is good.

The last check is simple. Ask what changes if this works. Maybe pages load 200 ms faster, maybe you drop two servers, maybe batch jobs finish before the morning rush. If the answer is "not much," keep the code in Go and fix the cheaper problem first.

Your next move

If your team wants to go to Rust for one performance hotspot, write down the bet before anyone opens a new repo. Put the problem in numbers, not feelings: latency, CPU load, memory use, queue delay, or cloud spend. Add one target that would make the work worth it, one estimate for team time, and one clear rollback path if the result disappoints.

A one-page note is enough if it answers four things:

what part of the Go service is slow
what result would count as a win
how much learning time the team will spend on Rust
how you switch traffic back to Go if the gain stays small

Keep the trial short. Two weeks is often enough to profile the hotspot, build a small Rust path, run a Rust benchmark plan, and test it behind a flag. If the experiment starts pulling in auth changes, schema changes, or a brand new service boundary, stop. That is rewrite drift, and it costs more than most teams expect.

Share the result with more than engineering. Product should judge whether users will notice the change. Ops should review deploy steps, logs, alerts, and failure modes. Finance should see the full cost, including training time and slower hiring if you add Rust to the stack.

Be strict about the result. If the mixed Go and Rust service trims a few milliseconds in a lab test but adds build pain, harder debugging, or on-call stress, keep the Go version. A small speed gain rarely pays for extra complexity over the next year.

If you want a second opinion before the team commits, Oleg Sotnikov can review the hotspot, the test plan, and the rollback plan. That can save a lot of wasted motion, especially when the cheaper fix is better caching, less copying, or a simpler query instead of Rust.