SQL queue tables vs dedicated broker: how to choose
SQL queue tables vs dedicated broker: compare failure modes, team effort, and throughput so you pick the simplest option that fits.

Why this decision gets hard fast
Most teams do not start by wanting a broker. They want a queue that works, and they do not want another service to install, patch, monitor, and fix at 2 a.m. If the app already uses SQL, a queue table feels cheap, familiar, and good enough.
The hard part shows up later. A job fails and retries. A worker sends an email, then crashes before it marks the row complete. The database slows down because it now handles both user traffic and background jobs. None of that looks scary on a diagram. It shows up during outages, partial failures, and cleanup work.
That is why the choice between a SQL queue table and a dedicated broker is rarely about raw speed alone. It is about how much failure your team can absorb before every incident turns into manual repair.
Small traffic can fool you. A product might process only a few hundred jobs on a normal day, then hit a sharp spike after a billing run, a big import, or one customer action that fans out into thousands of tasks. A queue that feels fine at noon can fall behind badly by 12:15.
Take a simple SaaS case. Most days, invoice jobs barely register. On billing day, the system creates 80,000 jobs in a short window. If one external API slows down and retries pile up, the backlog grows fast. At that point, the question is not abstract architecture. It is whether users wait, whether duplicate work slips through, and whether someone has to babysit the system.
Operations time matters as much as throughput. A broker often handles bursts better, but it adds another moving part. A queue table is easier to start with, but it pushes more responsibility into app code, query design, and database tuning.
A few blunt questions usually expose the real problem. How bad is duplicate processing? How often will jobs retry during outages? How sharp are your spikes? Who owns the system when it starts misbehaving? If those answers are still fuzzy, it is easy to pick the wrong tool.
What changes when you keep the queue in SQL
Keeping jobs in the same database as the app usually makes the first version easier to build. Orders, users, emails, imports, and background jobs sit in one place, so the team works with one system instead of two.
That also makes writes safer. If a customer places an order, the app can save the order row and insert the job in the same transaction. If the transaction fails, neither record appears. You do not need extra glue code to keep business data and queued work in sync.
That is a big reason SQL feels so good early on. You reuse backups you already trust, the same access rules, and the same monitoring. The team does not need new dashboards, new authentication, or a separate call rotation just to run background jobs.
Debugging is often simpler too. When work gets stuck, normal SQL can answer most questions. You can check which jobs have been running too long, which type fails most often, whether retries pile up for one customer, or whether a worker stopped updating heartbeats. A developer can inspect exact rows, timestamps, payloads, and retry counts without learning another tool.
The downside is that all the pressure lands inside the database. Workers can block each other if they poll badly. A busy queue table can grow fast. Finished jobs turn into dead weight. Indexes get larger, cleanup takes longer, and routine queries start touching more data than they should.
Cleanup matters more than most teams expect. If you keep every completed job forever, the table slowly becomes a log warehouse. If you delete rows in huge bursts, you create write spikes and slowdowns. Most teams need a clear retention policy, steady purging, and a table layout built for constant inserts and deletes.
A SQL queue usually fits well when job volume is moderate, failure handling is straightforward, and the team wants fewer moving parts. It stops feeling cheap when lock contention, table bloat, or slow polling starts stealing time every week.
What changes when you add a broker
A broker moves job traffic off the main database. That can be a relief when background work starts competing with customer requests for the same CPU, disk, and connection pool. During spikes, user writes stay calmer because workers pull from a separate system instead of hammering the same tables the app needs for normal reads and writes.
This matters most when traffic is uneven. A burst of 50,000 emails, webhooks, or image jobs can sit in the broker while the app keeps taking orders or saving form data. With a queue table, that same burst often shows up as slower queries, lock waits, or swollen indexes in the database that already runs the product.
The trade is simple. You reduce pressure on SQL, but you add another service to run. Someone now has to install it, patch it, tune it, back it up, and know what normal looks like when things go wrong.
A broker also forces the team to make delivery rules explicit. SQL tables let teams stay vague for a while because the data sits right there. Brokers make you decide how many times a worker retries, how long it waits between attempts, whether order matters, and when a message moves to a dead letter queue for manual review.
Those rules are useful, but they take work. Skip them and failed jobs can loop forever, arrive twice, or disappear into a pile nobody checks.
Throughput often improves too, but only after worker counts, acknowledgments, and batching are tuned. A broker does not create infinite capacity. It gives you more room to grow without leaning so hard on the database.
For a small team, that extra room is worth it only when the queue has become a system of its own. If job volume is steady and modest, and the database still has headroom, a broker can just add alert noise. If the team already runs several services and has solid monitoring, the extra load is easier to absorb.
Where each option breaks first
Most queue problems start on the consumer side, not the producer side. A system can accept work all day and still fail once workers stop keeping up.
With a SQL queue, the first crack usually appears when several workers fight over the same pending rows. They scan the same table, hit the same index pages, and wait on locks or row updates. At first it feels minor. Jobs seem a little slower. Then the database spends more time finding work than finishing it.
That pain spreads quickly because the queue shares a home with app data. If job polling gets noisy, normal reads and writes slow down too. A design that feels fine with two workers can get messy with twenty.
Retries make this worse. If a worker sends an email, charges a card, or writes to another system and then crashes before it marks the row complete, the next worker may run the same job again. SQL did not create the duplicate, but it does not protect you from it either.
Brokers hide backlog longer
A broker usually breaks in a less obvious way. Producers keep publishing, the broker keeps accepting messages, and everyone assumes the system is healthy. Meanwhile consumers fall behind, message age climbs, and the backlog turns into a slow wall.
That is why brokers can feel better right up until they do not. They absorb pressure well, so teams notice trouble late. By the time someone checks consumer lag, some jobs may already be too old to matter.
Restarts and network issues also change delivery timing. A consumer may process a message, lose its connection before it acknowledges it, and then see the same message again after reconnecting. That is normal behavior in many brokers. The same pattern can happen in SQL when a worker restarts between "did the work" and "marked it complete."
Slow consumers hurt the most
Slow producers are usually easier to live with because they limit incoming work. Slow consumers do the opposite. They create backlog, stale jobs, retry storms, and noisy alerts.
If one job takes 30 seconds and new jobs arrive every second, the tool choice will not fix the math. At that point, the debate is usually in the wrong place. You need faster handlers, more workers, smaller jobs, or stricter limits on what enters the queue.
That is the pattern worth watching first. SQL often breaks with contention you can feel in the database. Brokers often break with lag you do not feel until it is already large. In both setups, handlers that are safe to run twice matter more than perfect delivery promises.
What the team has to run every day
A queue is never just code. Someone has to watch it, clean it, tune it, and fix it when work stops moving.
With a SQL queue, you usually run fewer systems, but the database takes more of the pain. If queue depth starts growing, workers throw more errors, or retries climb faster than jobs finish, small issues turn into support tickets fast. A stuck worker, one bad deploy, or one slow query can leave jobs sitting in the table while the app still looks mostly fine from the outside.
SQL queues also need regular cleanup. Old jobs, large payloads, and long retry histories make tables swell, slow backups, and raise the cost of routine queries. The team also has to tune connection limits, worker concurrency, retry rules, and deletion schedules for completed and dead jobs. None of that is especially hard on its own. Together, it adds up.
A broker changes the checklist, not the fact that you need one. You still watch queue depth, consumer errors, retry counts, lag, and dead letter queues. You also need restart and failover practice before something breaks for real. If a node dies, if consumers reconnect all at once, or if a deployment rolls out in the wrong order, the team should know the steps from memory, not from a half finished document.
This is where teams often misjudge the choice. They compare features, but they do not count operator time. A better test is blunt: how many hours per month does the team spend babysitting the queue? Count the time spent clearing stuck jobs, cleaning old payloads, tuning concurrency, checking dashboards, and handling false alarms. If that number stays low, the simpler option usually wins. If the queue needs constant attention, the cheap design is not cheap anymore.
How to size your needs step by step
Start with the work itself, not the tool. Many teams choose too early. They pick based on habit, then learn later that their jobs are tiny and infrequent, or that their traffic spikes far past what the first setup can handle.
Write down every job type you run in plain language: send email, resize image, create invoice PDF, sync CRM record, rebuild search index. Then note how often each one runs on a normal day and what happens during peaks such as imports, launches, or nightly batches.
A simple worksheet is enough:
- List each job type, how often it runs, and what peak traffic looks like.
- Note payload size, typical runtime, and how late the result can be before users notice.
- Test retries, worker crashes, and duplicate execution.
- Compare average load with short bursts, not just daily totals.
Average traffic hides the part that hurts. A system that handles 20 jobs per minute all day may still struggle if a bulk upload creates 2,000 jobs in three minutes. Peaks decide more than averages do.
Runtime matters just as much as job count. If a job takes 200 ms, a small worker pool can clear a lot of work. If it takes 45 seconds, the queue grows fast unless you add more workers. Payload size changes the picture too. A 1 KB payload is easy to keep in SQL. Large blobs, event streams, or heavy fan out traffic push you toward a broker sooner.
User tolerance sets the bar. A password reset email can wait a few seconds. A checkout confirmation probably should not. If users expect near instant updates, measure the full delay, not just how fast the app enqueues work.
Do one ugly test before you decide. Kill a worker in the middle of a job. Force a timeout. Run the same job twice. If duplicate processing breaks billing, inventory, or customer messages, you need duplicate safe job handling before you need a new piece of infrastructure.
After that exercise, the shape of the problem usually gets clearer. Low volume, small payloads, and relaxed timing often fit SQL well. Heavy bursts, strict latency, and many concurrent consumers usually justify a broker.
A simple example from one product team
A SaaS team runs a product that sends monthly invoices, email receipts, and partner webhooks. At first, one Postgres queue table handles all of it. The app writes a job row, workers claim rows with FOR UPDATE SKIP LOCKED, and most days the system feels boring. That is a good sign.
Normal traffic stays small. A few thousand emails go out, invoices render in the background, and failed jobs sit in the same table with a retry count and a next_run_at time. The team can inspect stuck work with plain SQL, which helps when only two engineers handle support.
Then billing day arrives. Ten thousand invoice jobs land in a short window, followed by email bursts and a pile of webhook calls from customer systems. Database CPU jumps, not because each job is hard, but because workers keep polling the same table while the app still needs that database for user traffic.
The real friction shows up in retries. Emails are fairly forgiving. If a provider slows down, the team can retry a few times over an hour and accept a small delay. Webhooks are different. One partner expects fast retries for a few minutes, another rate limits hard, and a third needs signed requests plus a clear delivery log for failed calls.
At that point, the team does not need to rewrite everything. It keeps invoices and routine email jobs in Postgres, where the flow is easy to query and easy to reason about. Those jobs are predictable, tied closely to app data, and cheap to manage in one place.
The team moves only the noisy webhook traffic to a broker. That removes the burstiest workload from the main database, gives webhook workers their own retry rules, and lets the team scale consumers without touching invoice processing. The database still stores business records and delivery status, but it stops acting as the traffic cop for every burst.
That split is often the practical middle ground. A queue table can carry a product much farther than people expect, but bursty jobs with uneven retry behavior often create the first clear case for a broker.
Mistakes that lead to the wrong choice
Most bad queue decisions come from solving a future problem while ignoring today's workload. Teams often pick the tool that looks more advanced, not the one that matches their failure patterns, traffic, and staff time.
One common mistake is copying the stack used by a large company. Their broker setup may make sense because they run several services, have separate platform engineers, and deal with constant high traffic. A small product team with one app and a few workers usually does not get the same payoff.
The opposite mistake happens too. Teams keep everything in SQL, then forget that queue tables need care. Old rows pile up, indexes grow, and worker queries slow down. After a few months, the database handles job dispatch, app reads, writes, and cleanup at the same time. That is when people say "SQL is slow," even though the real problem is neglected table hygiene.
Another bad call is adding a broker before anyone measures real peaks. "We might need 100,000 jobs a minute later" is not a reason. If the busiest hour only reaches a few hundred jobs a minute, you may be adding another moving part, another backup problem, and another source of alerts for little gain.
Retries fool teams too. People assume a failed job will just run again and finish cleanly. Real systems are messier. A worker can send an email, charge a card, or call an API, then crash before it marks the job done. If the retry runs the same job again, you now have duplicates. The queue choice matters less than whether the job itself is safe to run twice.
Another mistake creates pain quickly: putting every job in the same queue with the same rules. Urgent work gets stuck behind slow, cheap work. A password reset should not wait behind 40,000 analytics updates or image resizes.
Some warning signs are easy to spot. You picked a broker because it felt "more serious," not because SQL actually failed. You have no plan to archive or delete old jobs. You cannot say what peak job rate was last week. Your retry logic can repeat side effects. Your queue does not separate urgent work from bulk background work.
If even a couple of those sound familiar, stop changing tools for a moment. Measure queue depth, oldest job age, retry count, and peak jobs per minute first. That small set of numbers usually makes the decision much clearer.
Quick checks and next steps
Most teams should keep the queue in SQL until the pain is clear. The most common wrong move is adding a broker because it feels more serious, not because the workload actually needs it.
If job volume is modest, the team is small, and the same people already run the database, SQL is usually the calmer choice. You get fewer moving parts, simpler debugging, and one place to inspect stuck work.
A broker starts to make sense when the workload changes shape. Sharp bursts, heavy fan out, or strict isolation between services can push a queue table past the point where it stays simple.
Before changing anything, look at one full month of real peaks, not an average week. A system that handles 2,000 jobs most days may still fail during the two hours each month when it gets 80,000.
Write down the failure rules before moving jobs. Decide how retries work, when a job is dead, who can replay it, how duplicate work is handled, and what happens if a worker crashes halfway through.
A short final check helps:
- Stay with SQL if the queue is small enough to inspect by hand and workers rarely fall behind.
- Lean toward a broker if bursts create long backlogs or one noisy job type delays everything else.
- Count the operations work too: upgrades, alerts, backups, dashboards, and time spent on incidents.
One practical next step is to test the current design before replacing it. Run a burst test, kill a worker mid job, force duplicate delivery, and measure how long recovery takes. That tells you more than any feature matrix.
If you want an outside review before adding more infrastructure, Oleg Sotnikov at oleg.is helps startups and small teams look at system design, operating cost, and where infrastructure changes are actually worth the extra complexity. A short architecture review is often cheaper than running the wrong setup for a year.
Frequently Asked Questions
Should I start with a SQL queue table?
Start with SQL if your job volume stays modest, your team is small, and the app already runs on one database. You get simpler setup, easier debugging, and one transaction for business data and queued work.
That choice stops feeling cheap when workers fight over rows, cleanup eats time every week, or queue load slows normal app traffic.
When should I move from SQL to a broker?
A broker makes sense when bursts get sharp, one noisy job type delays everything else, or background work starts hurting user traffic on the main database. It helps most when you need more isolation, more consumers, or separate retry rules.
If your queue still stays easy to inspect and workers rarely fall behind, a broker may just add more alerts and maintenance.
Is SQL safer because it supports one transaction?
Yes. SQL lets you write the app record and the job in one transaction. If the order save fails, the job insert fails too.
That removes a lot of glue code early on, but it does not solve duplicate processing after a worker starts the job and crashes before it marks it done.
What usually breaks first with a SQL queue?
Most teams hit contention first. Workers poll the same table, scan the same rows, and spend too much time claiming jobs instead of finishing them.
After that, table growth hurts. Old jobs, retries, and large indexes raise query cost, slow cleanup, and make the database do extra work for both the queue and the app.
What usually breaks first with a dedicated broker?
A broker often hides trouble longer. Producers keep sending messages, the broker keeps accepting them, and the backlog grows until job age gets ugly.
You need to watch consumer lag, retry counts, and dead letter traffic closely. If you only watch publish success, you will spot the problem late.
How do I stop retries from creating duplicate work?
Assume every job may run twice. A worker can send an email, charge a card, or call an API, then crash before it records success.
Make handlers idempotent where you can. Store external request IDs, check current state before repeating side effects, and keep replay rules clear for jobs that touch money or customer messages.
Should urgent jobs and bulk jobs share one queue?
No. Split urgent work from slow bulk work. A password reset or checkout email should not wait behind imports, analytics, or image jobs.
Separate queues also let you tune retries and worker counts by job type instead of forcing one rule onto everything.
How much cleanup does a SQL queue need?
Plan for regular cleanup from day one. Keep completed jobs only as long as you need them for support, audits, or debugging, then purge or archive them in small steady batches.
If you keep everything forever, the queue table turns into a log dump. That drives up storage, index size, backup time, and routine query cost.
Do occasional spikes mean I need a broker?
Not always. A monthly spike alone does not force a broker if the backlog clears fast and the database still has headroom.
Measure the real peak, the oldest job age during that peak, and how long recovery takes. If users feel the delay or the database slows down for the rest of the app, then the spike matters.
What should I test before I replace my current queue?
Run ugly tests on the system you already have. Fire a burst of jobs, kill a worker mid task, force a timeout, and replay the same job twice.
Then measure queue depth, oldest job age, retry count, and recovery time. Those numbers tell you far more than a feature chart.