Nov 09, 2024·8 min read

Redis streams vs PostgreSQL queues for workflow backbones

Redis streams vs PostgreSQL queues changes ordering, replay, tooling, and on-call work. Compare both models before your team commits.

Redis streams vs PostgreSQL queues for workflow backbones

Why this choice gets expensive later

The first version of a queue usually works fine in a calm test setup. The real costs show up later, when traffic gets uneven, workers crash, and someone needs to replay last Tuesday's jobs without sending duplicate emails or charging a customer twice.

Your workflow backbone decides more than where jobs wait. It shapes retries, failure handling, and recovery after a bad deploy or an outage. When that model is unclear, teams patch around it. One service gets a retry flag, another gets a timeout, and eventually someone runs a manual SQL script because nothing else works.

That is why "Redis Streams vs PostgreSQL queues" is not a simple speed test. You are choosing operating rules. How strict does ordering need to be? How long should messages stay around? What counts as done? Who cleans up when workers fall behind?

Changing direction later is expensive. Code built around database rows and transactional updates does not move neatly to a stream consumer model. Code built around consumer groups and acknowledgements does not drop into a SQL queue without rewriting retries, visibility rules, and monitoring. Then the data move starts. Historical jobs, retry state, and failure records all need a new home, and teams often find out too late that they never stored enough detail to replay safely.

Replay is where many early designs crack. A team may focus on throughput at the start and skip audit history because the system still feels small. Six months later, support asks basic questions: Which jobs ran? Which failed? Which were retried three times? Can we rerun only the broken ones? If the answer is "maybe, with a script," support costs start climbing.

Small choices add up fast. One weak deduplication rule can create duplicate side effects every week. Loose ordering can scramble account state. Short retention can erase the evidence you need during an incident. None of that looks dramatic in a design doc. It feels very real when one engineer spends 20 minutes every morning checking stuck jobs and cleaning them up.

What each option gives you

Redis Streams starts with an event log. Producers append entries to a stream, and Redis keeps them in order inside that stream. Consumers usually read through consumer groups, acknowledge work after they finish it, and leave unacknowledged entries in a pending list. If a worker dies, another worker can claim that stuck work after it has been idle long enough.

That model feels close to a message broker. You write once, read many times if needed, and keep some history for replay. It works well when your workflow looks like a flow of events moving through fast workers.

PostgreSQL queues start with a jobs table. A producer inserts a row, and workers claim rows and update state as they go. Most teams add columns such as status, attempts, available_at, locked_by, locked_at, and last_error. The queue is not a separate system. It is regular application data with SQL around it.

That difference matters every day. In PostgreSQL, retries usually mean updating the same row and moving its next run time forward. In Redis Streams, retries often mean the same stream entry stays pending until a worker acknowledges it or another worker claims it. Both models can retry work, but they track that work in different ways.

Cleanup follows different paths too. Redis Streams keeps data until you trim it by length or age. Forget that, and memory grows. Trim too hard, and replay options vanish. PostgreSQL queues need deletes, archiving, or partitioning. Keep every old job in one busy table, and queries slow down while vacuum work grows.

For many teams, PostgreSQL feels simpler because the data is visible and familiar. You can inspect rows, join them to business records, and audit changes with normal SQL. Redis Streams often feels lighter and faster for heavy event flow, but it adds a separate operating model. That is the real trade: event log semantics versus row state semantics.

Ordering under real traffic

Teams often say they need strict order, but real systems rarely process work as one clean line. In both models, order starts to bend as soon as you add more than one worker.

A queue can store jobs in a clear sequence. That does not mean workers will finish them in that same sequence. Worker A may grab job 101, worker B may grab 102, and job 102 may finish first because it is smaller or because job 101 is waiting on a slow API.

Retries make this messier. A failed job often goes back for another attempt after a delay. While it waits, newer jobs keep moving. Enqueue order, pickup order, and completion order can all diverge.

What "in order" usually means

Once you use partitions, shards, or consumer groups, order usually means "ordered inside one lane," not "ordered everywhere." That is often enough if you define the lane well.

For example, you might keep all events for one customer, account, or invoice in the same lane. That protects local order where it matters and lets unrelated work run in parallel. In practice, this is usually better than forcing one global sequence across the whole system.

PostgreSQL queues and Redis Streams both run into this limit. PostgreSQL workers that poll concurrently can race each other. Redis consumer groups also spread messages across consumers, and a slow consumer can fall behind while others keep moving.

Where transactions change the story

Order problems get worse when writing data and enqueueing work happen in separate steps. If your app saves a row and then crashes before it sends the job, you now have data with no follow-up work. The reverse can happen too: the job exists, but the data change never committed.

PostgreSQL has a clear advantage here. You can write business data and insert the queue record in one transaction. That keeps state changes and queued work tied together.

With Redis, teams often need an outbox table or another recovery step to get the same safety. That is not a deal breaker, but it adds operator work and one more place for ordering bugs to hide.

If order matters, define the unit first. Decide whether order matters per customer, per account, per document, or per workflow run. That choice usually matters more than the queue brand.

Replay, backfills, and audit trails

Replay matters the first time a bug sends 4,000 jobs down the wrong path, or a rule change means you need to process last month's records again. Teams often think about throughput first and replay later. That sequence gets expensive.

With Redis Streams, replay only works if the events still exist. That sounds obvious, but many teams trim streams to control memory use. If you keep only the last 100,000 entries and discover a bug three days later, the older events may already be gone. Consumer groups help you track what each worker has read, but they do not replace long term history.

Redis can still work well for replay if you plan for it. You need a retention policy that matches your recovery window, and you need to accept the memory cost of keeping more history. If your workflow is short lived and you only need to recover recent jobs, that trade may be fine.

PostgreSQL replay usually starts from saved rows. A common setup keeps each job or event in a table with fields such as status, attempts, created_at, processed_at, and error_message. When a bug is fixed, the team can reset a group of rows from failed to pending, or create a fresh batch from the same source records. That is slower than reading from memory, but it is much easier to reason about.

A simple example makes the difference clear. Imagine a SaaS signup flow where every new customer gets a welcome email, trial credits, and a CRM record. If the credit rule was wrong for two weeks, a PostgreSQL queue can usually find those rows with a query and rerun only that step. In Redis, you can do the same only if the original events were retained, or if you also saved a durable copy somewhere else.

Audit needs push many teams toward PostgreSQL. People eventually ask very plain questions: What happened to customer 1824? Which worker ran this job? When did it fail, and how many times? Which version of the rule created this result?

A queue table or event table makes those answers easier to store and query. This decision is not only about speed. It is also about history. If your workflow affects billing, approvals, compliance, or customer records, durable rows usually beat fast transient messages.

What operators deal with every week

Protect Billing Workflows
Map duplicate charges, retry rules, and audit gaps before they hit support.

Operators usually feel this decision before developers do. The design may look similar on paper, but the weekly chores are different.

Redis Streams pushes you to watch memory closely. Old entries do not disappear on their own, so someone needs trim rules, retention limits, and checks for sudden growth after a worker slowdown. PostgreSQL moves the pain to disk: tables grow, indexes bloat, autovacuum falls behind, and cleanup jobs need regular attention.

Backup and restore also differ in a practical way. If jobs still matter after a restart, PostgreSQL is often easier to reason about because the queue sits next to the rest of your app data, with the same backup flow and the same recovery drills. Redis can persist stream data too, but teams often learn that keeping messages is not the same as restoring clean consumer state after a failover.

Slow workers create work in both systems, just in different places. In Redis, pending entries pile up inside consumer groups, idle timers start to matter, and one bad deploy can trigger a retry storm. In PostgreSQL, stuck jobs usually show up as rows that never leave "running" status, leases that expire late, or workers that keep picking the same failed item.

A small weekly checklist helps more than a fancy diagram. Check queue growth against your retention policy. Review retry counts and dead letter volume. Find consumers or workers that stopped making progress. Test one restore path for jobs that still matter. Confirm failover behavior under load, not only in staging.

Failover rules shape on call load more than most teams expect. Redis with replicas, Sentinel, or cluster mode needs clear rules for promotion, reconnects, and message ownership after a node change. PostgreSQL has its own sharp edges, but many teams already know how to run it, so adding a queue there can mean fewer moving parts.

That is often the weekly tradeoff. If your team already runs PostgreSQL well and the queue is close to your core data, PostgreSQL usually creates fewer surprises. If you need heavy event flow and short lived work, Redis can fit well, but only if someone owns memory limits, consumer recovery, and cleanup with discipline.

This is why experienced CTOs tell teams to price operator time, not only benchmark speed. A faster queue can still cost more if it wakes people up at 2 a.m.

A simple example from a SaaS signup flow

A new account looks simple on the surface. A user signs up, confirms an email address, and expects the product to work right away. Behind that moment, your system often kicks off several jobs at once.

One signup might trigger four different actions: send a welcome email, create or update a billing record, push the new customer into the CRM, and send a webhook to another app.

Those jobs do not all carry the same risk. If the welcome email goes out twice, that is annoying but usually fixable. If billing runs twice, you may charge the customer twice, and support will feel that pain immediately.

That difference shapes the decision more than most teams expect. PostgreSQL often feels safer for work that must happen once and must leave a clear record. You can store the signup, the billing state, and the queue item close together, then check exactly what happened. It is slower, but the paper trail is better.

Redis Streams fits better when you care more about fast fan out and replay across several consumers. If the CRM sync falls behind for 20 minutes, or a partner endpoint goes down, support can replay missed webhook events later. That is a real advantage when outside systems fail, which they often do.

Now picture a small outage. Your app keeps accepting signups, but the webhook worker is down. Later, support gets messages from customers asking why their accounts never reached a connected tool. With a replay friendly event log, the team can rerun those webhook jobs without touching billing again. That is much harder if all work lives in one simple database queue with no event history.

Most teams do not need one perfect backbone for everything. They need the least risky model for the failure that hurts most. If duplicate charges would be a disaster, favor the model that makes billing strict and easy to inspect. If missed downstream events create the larger mess, favor the model that makes replay cheap and routine.

How to choose step by step

Clarify Your Retry Rules
Set retry, deduplication, and dead letter rules your team can explain at 2 a.m.

Start with the work, not the database. Teams often pick Redis because it feels like a queue, or PostgreSQL because they already run it. That shortcut gets expensive when you need replay, audits, or calm nights on call.

A simple decision process works better than abstract debate. Write down every job your app runs and mark each one as either "strict order" or "order does not matter much." A billing event and an account suspension often need a clear sequence. A welcome email usually does not.

Then decide how long job history needs to stay useful. If you may replay work next week, rebuild state after a bug, or answer "what happened to this customer?" three months later, put that in the design from day one.

Next, list the failures your team already sees: duplicate webhooks, workers that crash mid job, slow third party APIs, and deploys that leave half finished work behind. The right queue model should make those failures boring.

After that, name the person who will own backups, cleanup, dead letter handling, and alerts. This sounds dull, but it often decides the outcome. If your team already knows PostgreSQL well and has no appetite for another service, that matters. If you already run Redis carefully and need very fast fan out, that matters too.

Finally, build one real workflow before you standardize. Use something close to production, not a toy demo. Run it for a few days, kill a worker on purpose, replay old jobs, and restore from backup once. Then count how much operator time the system actually needs.

That is where the choice gets practical. Pick the model that fits your ordering rules, your replay window, and the people who will support it at 2 a.m. A short pilot usually settles the argument faster than another architecture meeting.

Mistakes teams make early

Teams often pick a queue after one fast benchmark and call the problem solved. That is how small test results turn into a long weekly tax. A stream that looks cheap at 10,000 messages in a lab can cost much more once people start debugging stuck jobs, replaying missed work, trimming old data, and answering questions like "did this task run twice?"

Another common mistake is treating an event log and a job queue as the same thing. They are related, but they solve different problems. A job queue answers "what should a worker do next?" An event log answers "what happened, in what order, and can we read it again later?" Redis Streams sits somewhere in the middle, which is useful but easy to misuse. PostgreSQL can cover both patterns too, but only if you design tables, retries, and retention rules carefully.

Retries confuse people even more. A failed job that goes back into the queue does not automatically keep its original place in line. Under real traffic, retries can jump ahead, fall behind, or run beside newer work unless you add strict controls. That matters when one customer action depends on another, such as creating an account before sending a welcome package or billing record.

Storage cleanup gets ignored for too long as well. Then one day the queue is huge, old jobs are still sitting around, and nobody agrees on what should stay. If you use Redis Streams, you need a clear trim policy and a rule for pending entries that workers never acknowledged. If you use PostgreSQL, you need a plan for old rows, indexes, vacuum pressure, and how long replay data should live.

A lot of teams also copy a choice from a company with a very different workload. That shortcut causes more pain than people expect. A product with short, disposable background jobs may do fine with one model. A system that needs replay, audits, and long running workflows may need another.

A better early check is simple. Write down what must stay in order per user, per account, or globally. Decide how you will replay work after a bug fix. Set retention and cleanup rules before launch. Count operator time, not just message throughput. Test failure cases, not only normal speed.

That is usually where the debate stops being about benchmarks and starts looking like an operations decision.

Quick checks before you decide

Pick The Right Backbone
Get a second opinion on Redis Streams, PostgreSQL queues, and failure handling.

Most teams can narrow this choice in ten minutes if they answer a few blunt questions. The best design on paper often loses to the one your team can run calmly during a busy week and debug half asleep.

If your work looks like a plain job queue, PostgreSQL often wins by being boring and easy to support. If many consumers need the same event stream, Redis Streams starts to make more sense because it was built for fan out, consumer groups, and replay windows.

A few checks help. If one worker should claim a job, do the work, and mark it done, PostgreSQL is usually enough. If several services need to read the same event independently, Redis Streams fits better. If old jobs can disappear after a few days, Redis retention policies may be fine. If you need a longer trail, PostgreSQL is simpler to keep and inspect. If your team already lives in Postgres every day, keeping one database may reduce support work, backups, and on call stress.

One warning matters in either model: if workers cannot handle duplicates safely, fix that first. Both approaches can deliver the same work more than once during failures.

That duplicate point deserves more weight than many teams give it. A queue choice will not save you from retry logic, crashed workers, or network timeouts. If a worker can send two welcome emails, charge a card twice, or create two accounts, the design is still fragile.

Debugging is the last check, and it deserves real weight. At 2 a.m., can someone on your team answer three questions quickly: what got queued, who picked it up, and what happened next? In PostgreSQL, that often means a few SQL queries against tables you already know. In Redis Streams, the answers are there too, but your team needs to be comfortable with stream IDs, pending entries, acknowledgements, and retention rules.

For many small SaaS teams, the simplest answer is enough: start with Postgres if you want a dependable job queue and easier support. Pick Redis Streams when you truly need multi reader event flow, replay, and more control over consumer behavior.

What to do next

Do not make this choice in a meeting and move on. Write a short note first. Keep it plain: how much ordering you need, how replay should work, what operators will need to check every week, and what happens when a worker dies halfway through a job.

That note exposes weak assumptions quickly. Teams often say they need strict ordering, then realize they only need per user or per account ordering. They also say replay matters, but only for a few workflows such as billing fixes, imports, or delayed webhooks.

After that, build a small test that looks like your real system, not a demo queue with one worker and five clean messages. A good test should include failure on purpose. Stop a worker in the middle of processing. Send the same event twice. Delay one consumer by a few minutes. Replay yesterday's jobs into today's data. Check what an operator sees when jobs pile up.

You do not need a big benchmark yet. Fifty or a hundred realistic jobs often teach more than a synthetic load test. Watch where the team gets confused. If nobody can explain why a job ran twice, arrived late, or disappeared from the dashboard, take that as a warning.

Pick the simpler model your team can explain without hand waving. Simple wins twice: people debug it faster, and they make fewer bad fixes under pressure. If PostgreSQL gives you enough ordering and replay with fewer moving parts, that is often the safer call. If streams match your traffic pattern better, make sure someone on the team can run them calmly at 2 a.m.

If this decision could affect cloud spend, staff time, or uptime, a short architecture review is usually worth it. Oleg Sotnikov at oleg.is works with startups and small teams on decisions like this as a Fractional CTO and advisor, especially when workflow design, infrastructure, and AI driven development start to overlap.

The best next step is small and concrete: write the note, run the ugly test, and choose the option your team can keep alive without drama.

Frequently Asked Questions

Should most SaaS teams start with PostgreSQL queues?

Usually yes. If your app needs a dependable job queue, clear audit rows, and fewer moving parts, PostgreSQL gives most small SaaS teams the safer starting point.

When do Redis Streams make more sense?

Pick Redis Streams when several consumers need the same event, replay matters, and your team accepts the extra work around retention, pending entries, and consumer recovery.

Can either option guarantee strict ordering?

No. Once you run more than one worker, finish order starts to drift in both systems. Define order per customer, account, or document instead of chasing one global sequence.

Which option is safer for billing jobs?

PostgreSQL usually fits billing better because you can save business data and enqueue work in one transaction. That makes duplicate charges and missing follow-up work easier to prevent and inspect.

How does replay differ between them?

In PostgreSQL, teams often reset failed rows or create a new batch from saved records. In Redis Streams, replay works only if you still kept the events or saved them somewhere durable.

Which one creates more operator work?

Redis often creates more weekly work around memory, trimming, and stuck pending messages. PostgreSQL shifts the work to table cleanup, vacuum, and archiving, but many teams already know those chores.

Do I still need idempotency with both models?

Yes. Both models can deliver the same work more than once after crashes, retries, or network trouble. Make workers safe to run twice before you trust either queue.

How long should I keep job history?

Keep history for as long as you may need to answer support questions or rerun work after a bug. If billing, approvals, or customer records depend on the queue, keep a longer trail than you first expect.

Is it hard to switch from one model later?

Yes, and the pain goes beyond moving data. Your code, retry rules, monitoring, and failure handling all change, so teams usually rewrite more than they planned.

What should I test before I choose?

Build one real workflow and break it on purpose. Kill a worker mid-job, replay old work, restore from backup, and check whether your team can explain what happened without guesswork.