Feb 01, 2026·8 min read

Python to Rust migration for a single bottleneck service

Planning a Python to Rust migration for one bottleneck service? Start with profiling, move only the slow path, and keep deployment easy for the team.

Python to Rust migration for a single bottleneck service

What this problem looks like

A bottleneck service rarely looks broken at first. Most of the day, it works. Requests go through, users get results, and the team keeps shipping features in Python without much trouble. Then traffic spikes, one job grows larger than expected, or a customer runs a heavier query, and one part of the service starts dragging everything else down.

You usually see it in small, annoying ways before it turns into an outage. Pages load slower. Background jobs stack up. CPU use jumps on one instance while the rest of the system waits. A deploy that should feel routine turns stressful because one endpoint or worker can no longer keep up.

In daily work, the pattern is often simple:

  • one API route suddenly takes 2 seconds instead of 200 ms
  • one report job blocks a queue for minutes
  • one parser or scoring function burns most of the CPU
  • one customer action causes timeouts for everyone behind it

That is why a single slow function matters so much. Even if 95 percent of the service is fine, the slow part can set the pace for the whole system. It fills worker pools, delays retries, and makes autoscaling look like a fix when the real problem sits inside a few lines of code.

Python may still be the right choice for most of the service. It is fast to change, easy to read, and good for business logic, APIs, and glue code. Teams get into trouble when they assume slow performance means the whole stack is wrong. Often, it means one loop, one serializer, one matching algorithm, or one data transformation needs a different tool.

That is where Python to Rust migration makes sense. Not as a full rewrite, and not as a bet on a new language solving every issue. The better version is smaller and less dramatic: keep the Python service, find the slow path, and move only that part to Rust.

That approach gives the team room to learn without turning the service into a science project. You fix the pain users actually feel, keep most of the codebase familiar, and avoid months of rewrite work that may never pay off.

Find the slow part first

A Python to Rust migration usually starts with a guess. That guess is often wrong. The slow request may look like a Python problem, but the real delay might come from a database query, a third-party API, disk I/O, or a queue that backs up under load.

Start with numbers from real traffic or a replay of recent production requests. Measure total request time, CPU use, and memory use for the service as it runs today. If CPU stays low while latency climbs, Python code may not be your main problem. If one worker pegs a core and response time rises with it, that is a stronger signal.

Keep compute time separate from waiting time. A request that spends 40 ms doing math and 300 ms waiting on Postgres does not need Rust first. A request that spends 250 ms inside one parsing, scoring, or transformation step might.

A short checklist helps:

  • time the full request from entry to response
  • time the heavy function or code block inside it
  • record CPU use during peak load
  • record memory growth across many requests
  • test with real payloads, sizes, and error cases

Toy benchmarks waste time. A tiny JSON blob, a clean cache, or a fake ten-row dataset can make Python look fine. Then production sends a 12 MB payload, a weird customer file, or ten thousand records, and the slow path shows up fast. Use samples that match the messy cases your service already handles.

Write down one target before you touch Rust. Pick something concrete, such as "cut p95 latency from 420 ms to 220 ms" or "drop CPU use by 35% on this endpoint." One target keeps the work honest. It also tells you when to stop.

Oleg Sotnikov often pushes teams to profile before they rewrite anything, and that advice holds up. Teams learn faster when they move only the slow path, measure again, and leave the rest of the service alone until the data says otherwise.

Choose a small Rust target

Once profiling shows where the time goes, keep the first Rust change boring. Pick one function that does heavy CPU work, takes clear input, and returns a clear result. That gives the team a small surface area to learn, test, and explain.

Leave request handling in Python if it is already fast enough. The same goes for validation, routing, and most business rules. Those parts often change more than the hot loop does, and Python is usually easier to adjust when product logic keeps moving.

A strong first target is a pure compute path. Think of a ranking function that scores thousands of records, a parser that chews through large payloads, a dedup step, or a loop that transforms data item by item. These jobs stay close to the CPU, so Rust can help without forcing you to redesign the whole service.

A weak first target is anything tied to outside systems. Database calls, queue consumers, auth checks, retries, and network requests often spend more time waiting than computing. Moving those pieces to Rust in the first pass adds work, but it rarely removes the real bottleneck.

The boundary should fit in one plain sentence. For example: "Python sends a batch of events to Rust, and Rust returns scored results." If the team cannot describe the handoff that simply, the target is probably too large.

One quick test helps. Ask whether you can write the Rust part as a single module with a tiny interface:

  • input data in a predictable format
  • no direct database access
  • no queue or auth logic
  • one result or one small set of results

That shape keeps the mixed Python Rust service easy to reason about. Python stays in charge of the request, logging, auth, and database work. Rust handles the one expensive section that actually burns CPU.

A small example makes this concrete. Say your service receives a request, loads customer data, applies ten business rules, and then runs a fraud score over 50,000 transactions. Keep the request flow and rules in Python. Move only the fraud scoring loop if that is where the CPU time lives.

This approach feels less dramatic than a full Python to Rust migration, but it usually works better. You get a measurable speedup, the team learns Rust on a narrow problem, and you avoid turning one slow path into a full rewrite.

Keep deployment simple

A migration goes smoother when production still looks familiar. If your team already runs one Python service, keep that shape for as long as you can. The safest first move is often a Rust library that Python calls for one slow function, not a brand new Rust service with its own container, config, and on-call burden.

That choice reduces risk in boring but important ways. You keep one network boundary, one deployable unit, and one place to inspect when something breaks. The team can learn Rust without learning a whole new release model at the same time.

For the first version, leave these parts alone if you can:

  • the public API
  • request and response payloads
  • log format and log fields
  • alerts, dashboards, and error codes

If clients send the same JSON and get the same JSON back, rollout gets much easier. Support teams do not need new runbooks. Existing dashboards still help because people can search the same fields and compare behavior before and after the change.

A mixed Python Rust service often works better than people expect. Python can keep the web layer, validation, auth, and general glue code. Rust can take the tight loop that burns CPU or the parser that handles too much data. That split is simpler to debug than two services talking over HTTP.

Keep the build and release flow familiar too. If your team already builds one Docker image in CI and ships it the same way every time, keep doing that. Add one build step for the Rust extension, then publish the same image to the same place. Simple Rust deployment usually means fewer moving parts, not more.

Package the Rust part so Python imports it like any other module. When something fails, return the same status codes and the same error shape you return today. That discipline saves a lot of time during rollout, because every change in behavior looks like a real bug instead of a side effect from a new architecture.

One new language is enough. New language plus new service plus new operations work is where teams lose weeks.

A step-by-step migration plan

Get a Second Architecture View
Get practical advice if you are unsure Rust is the right fix.

Most teams move too much code too early. A safer Python to Rust migration starts with proof, not hope, and keeps the first change small enough to roll back in minutes.

Start by freezing the current behavior. Write tests around the slow code, but do not stop there. Save real sample inputs and expected outputs from production too, especially ugly cases: empty values, strange Unicode, large payloads, and borderline numbers. Those files will catch the bugs that neat unit tests miss.

Then make the Python side boring. Put the slow function behind one clear interface with stable input and output types. If the rest of the service can call that interface without caring whether the work happens in Python or Rust, the migration gets much easier.

A simple plan usually works best:

  1. Lock down behavior with tests and sample cases from real traffic.
  2. Wrap the hot function in a thin boundary so one call site controls the swap.
  3. Port only that function to Rust and keep the surrounding service in Python.
  4. Run both versions on the same workload and compare every result, including errors and edge cases.
  5. Release the Rust path behind a flag or a small traffic slice, with the Python path ready as a fallback.

When you compare outputs, do it with care. Check ordering, rounding, null handling, text encoding, and timeout behavior. "Close enough" often turns into a production bug. Line by line checks feel slow, but they save days later.

Use the same workload for benchmarking both versions. Run them on the same machine if you can. Measure total request time, not just the Rust function in isolation, because data conversion between Python and Rust can eat part of the gain.

Keep deployment simple while the team learns. One compiled module inside the current service is usually easier than a brand new Rust microservice with its own build, logs, alerts, and release process. Oleg often pushes teams toward this kind of small, reversible change first, because it cuts risk and teaches the team faster than a big rewrite.

If the flag stays quiet, latency drops, and errors do not move, then expand the Rust path. If anything looks off, switch back and inspect the diff instead of guessing.

A simple example

A file import worker is a good case for this approach. Users upload large CSV files, wait for the import to finish, and expect clear status updates if something goes wrong. The Python worker already does a lot well: it accepts the file, validates columns, stores job progress, and turns failures into messages people can understand.

The trouble starts when the file is huge. Think of a customer import with 1.5 million rows, where each row needs cleanup and a match against existing records. Most of the wall time does not come from uploads or database writes. It comes from one tight loop that normalizes text, compares fields, and scores possible matches over and over.

A profiler points straight at that loop. It might take 70 to 80 percent of the CPU time, while the rest of the worker stays fairly ordinary. That is a solid target for a Python to Rust migration, because the team can move only the slow path and leave everything else alone.

Python still handles the parts users notice:

  • file upload and job creation
  • progress updates
  • retry logic
  • readable error messages
  • final result storage

Rust replaces one function: given parsed rows and lookup data, return the best match for each row. The worker becomes a mixed Python Rust service, but only in one small place. Python prepares the data, calls the Rust code, and continues as before.

The deployment stays simple too. The team ships the same worker image, with one compiled Rust extension inside it. They do not add a new network service, a new queue, or a second deploy step while everyone is still learning. If the Rust call fails, they can fall back to the old Python path for a while.

From the user's side, nothing changes. They still upload the same file in the same screen. They still see the same progress states. The only visible difference is speed. A job that took 12 minutes may now finish in 3, and the team reached that result without rewriting the whole worker.

Mistakes that waste time

Profile Before You Port
Use real traffic and real payloads to see if Rust will pay off.

One common mistake is trusting laptop profiles too much. A service that looks slow on a developer machine can behave very differently under real traffic, real data size, and real network conditions. If you profile on a quiet laptop and assume production will match, you can spend a week speeding up the wrong function.

Teams also lose time when they rewrite more than they need. If the team is still learning Rust, moving a whole module may feel cleaner, but it often turns a small performance fix into a long training project. Keep the Python to Rust migration narrow. Move the hot loop, parser, or CPU-heavy worker, then measure again.

Another trap is changing behavior while changing language. If one pull request rewrites business logic and ports it to Rust, debugging gets messy fast. Nobody can tell whether a bug came from the new rules or the new implementation. Keep the logic the same first. Once the Rust code matches Python output, then improve the design.

A lot of teams split one service into several services too early. That adds RPC calls, more deploy steps, and more places for failures to hide. For a mixed Python Rust service, a shared library or one container is often the calmer choice while the team learns. Simple Rust deployment is not glamorous, but it saves hours.

The last waste shows up after release. Some teams ship the Rust path, see one good benchmark, and stop watching it. That is risky. You still need monitoring for latency, error rate, memory use, and queue depth, especially in the first few days.

A few warning signs show you are drifting off course:

  • Production numbers stay flat, but the rewrite keeps growing
  • Deploys got harder than the original bottleneck
  • The team is fixing porting bugs instead of improving the slow path
  • Memory use climbs even though CPU time dropped

If you avoid these mistakes, you usually learn Rust faster and keep the service stable at the same time.

Quick checks before rollout

Keep the Deploy Boring
Add Rust without a new service, a new queue, or extra on call pain.

Run the new path against real requests before you flip any traffic. A fast rewrite that changes results will create more work than it saves.

Start with output checks. Take a sample of real requests, send them through the Python path and the Rust path, and compare the results line by line. If the service handles floats, timestamps, or sorting, define what counts as "the same" before you test. Small format differences can hide real bugs.

Speed checks need two numbers, not one. Median time tells you whether normal requests got faster. Worst-case time tells you whether rare slow requests still hurt users. If median latency drops from 120 ms to 40 ms but some requests still spike to 3 seconds, the rollout is not done.

A short pre-release checklist helps:

  • replay saved requests and diff outputs
  • compare median latency and tail latency on the same workload
  • confirm build, deploy, and rollback use the same team workflow
  • check that logs, errors, and alerts still read clearly at 2 a.m.

Keep deployment boring. If your team already ships one container, try to keep shipping one container. If you already use the same CI pipeline, keep it. This is not the moment to add a new package manager, a new release tool, and a new dashboard.

The on-call view matters more than people expect. If the Rust code changes log fields, error text, or metric names, the team can lose time during an incident. Keep old names where you can. If you cannot, update dashboards and alert rules before rollout. Teams using tools like Sentry, Grafana, or Prometheus should see the new path in familiar places, with labels they already understand.

One last check is ownership. Pick one person who owns the first week after release and one person who owns the next round of tuning. Rust often removes one bottleneck and exposes the next one. If nobody owns that follow-up work, the migration stalls right after the hardest part.

What to do next

Start with the smallest trial you can defend. Pick one endpoint or one background worker that is slow often enough to measure, but isolated enough that a rollback is boring. That keeps the team's first Python to Rust migration small and teachable.

Before anyone writes Rust, agree on one number that decides whether the trial worked. It might be p95 latency, jobs per minute, CPU time per request, or cloud cost for that worker. Choose one metric, write down the current baseline, and check it again after release. If the number did not move in a useful way, stop and ask why.

A simple first pass looks like this:

  • Keep the Python API or worker contract the same.
  • Move only the hot loop, parser, transform, or compute-heavy step.
  • Deploy the Rust part in the plainest way your team can support.
  • Add a rollback switch before real traffic hits it.

After release, capture what the team learned while it is still fresh. Write down build pain, packaging issues, test gaps, logging problems, and how long small fixes took. Two pages of notes can save weeks on the next path. Teams often skip this and repeat the same mistakes with a larger service.

If the first trial works, repeat the pattern on the next bottleneck instead of opening a broad rewrite. One small win is enough to build trust. Three half-finished migrations usually create more support work than speed.

If you want an outside view before you commit engineering time, Oleg Sotnikov can help scope a small, low-risk migration as part of Fractional CTO or architecture advice. That helps when you are not sure the bottleneck needs Rust at all, or whether a better design in Python would fix the problem faster.