Dec 23, 2024·8 min read

Background job idempotency in Python for payments and sync

Learn background job idempotency in Python with simple payment and sync examples, safe retry rules, partial failure handling, and cleanup steps.

Table of Contents

Why duplicate jobs cause real damage

One event can create the same job twice more easily than most teams expect. A customer clicks "Pay" again after a slow spinner. A webhook gets delivered twice. A queue puts the message back because the worker did not confirm it in time.

That sounds harmless until money or data moves. A retry can charge a card again, send the same invoice twice, or import the same customer record into your CRM with a new internal ID. The first bug hurts trust. The second bug creates cleanup work that can drag on for days.

The worst part is that duplicates often come from normal failures, not weird edge cases. A worker can crash after it charges a card but before it saves the result. A timeout can hide whether the payment provider accepted the request. Two workers can grab the same job at nearly the same moment. A replay can happen after a deploy, a manual rerun, or a queue redelivery.

Without background job idempotency in Python, retries turn routine failure handling into a business problem. You do want retries. They save jobs when networks fail or services go down for a minute. But each retry must aim at the same real-world outcome, not create a new one.

Think of a payment job for order 481. Ten attempts might happen because of timeouts, crashes, and redelivery. The safe result is still one charge, one order update, and one final status in your system.

The same rule applies to sync work. If the import job runs three times, the contact should still exist once, with one correct state. Many attempts are fine. One real outcome is the only result users should ever see.

Set the job contract before you code

A retry is safe only when the job has a clear contract. For background job idempotency in Python, that contract matters more than the retry loop itself. If you skip it, two workers can do the same work twice and both think they did the right thing.

Start by naming every side effect the job can create. A payment job often does more than charge a card. It may also write a payment row, mark an invoice as paid, send a receipt, and push an event to another system. A sync job may create local records, update timestamps, and queue follow-up work.

Write the contract in plain language:

What external action can happen
What database rows can change
What counts as success for one run
What a second run must do if the first one already finished
When the worker must stop and raise an alert

Success should mean one business result, not just "the code finished." For a payment job, success is usually "the customer got charged once, and we stored that outcome." If the charge succeeded but the worker crashed before saving it, the retry should not charge again. It should fetch or confirm the existing charge and finish the missing local work.

Repeat runs need a defined return value. Some teams return the original result, such as the payment ID or imported record ID. Others return a no-op status like "already processed." Either choice works if every caller handles it the same way.

Stop rules need just as much care. If the provider returns conflicting data, if the customer record is missing, or if the job hits the same timeout five times, stop retrying and alert a person. Endless retries turn a small bug into a billing mess or a pile of duplicate records.

Choose an idempotency key that survives retries

Most retry bugs start with the wrong identifier. If each retry gets a fresh random ID, your worker cannot tell whether it is seeing the same job again or a brand new one.

Use a business ID that already means something outside the worker. For a payment, that is often an order ID or invoice ID. For a sync job, it is usually the source record ID from the system you are importing from. Those values stay the same when the process restarts, the queue replays a message, or a worker crashes halfway through.

This matters a lot in background job idempotency in Python because retries are normal. A network timeout, a deploy, or a queue reconnect can run the same job twice within minutes. If the idempotency key changes between attempts, you lose the whole safety net.

A simple rule helps: build the key from the business event, not from the delivery attempt.

Good choices: order_18452, invoice_90017, crm_contact_4419
Bad choices: a UUID created inside the worker, the current timestamp, a hash of the raw payload if field order can change

Time-based values break easily. So do payload hashes when two equivalent payloads serialize in a different order. Even small formatting changes can create a different hash for the same real-world action.

A payment example makes this obvious. If a customer clicks "Pay" and your job runs three times, all three attempts should reuse the same payment key tied to that order. The gateway can then reject duplicates or return the first result instead of charging again.

The same pattern works for sync. If a customer record with source ID 4419 gets imported today and retried tomorrow, the worker should still look up 4419, not a new retry token. Stable input gives you stable behavior, which is the whole point.

Store state where every worker can check it

A job is not idempotent if one worker knows the history and another worker does not. Memory, local files, and process-level caches fail the moment the queue retries on a different machine. Put the job state in one shared place, usually the same database your workers already trust.

For background job idempotency in Python, the storage rule is simple: every retry must be able to answer two questions fast. Did this job already start, and did it already finish?

What to save

A small table usually does the job:

the idempotency key, with a unique constraint
a status such as started, finished, or failed
the external request ID, such as a payment provider charge ID or sync batch ID
the final result, plus timestamps for creation, update, and expiry

That unique constraint matters more than many teams think. If two workers read first and insert later, they can both pass the check and both run the job. Insert the row first, let the database reject duplicates, then load the existing row and decide what to do next.

Store the external request ID as soon as you get it. If a payment worker creates a charge request and crashes before it writes the order update, the next retry can look up that same external ID instead of charging again. The same idea works for sync jobs. If you already imported customer "A123" into your system, the next retry should reuse that mapping, not create a second customer.

Save the final result too, not just the status. A repeated job should return the same answer when possible. For a payment, that might be "charged" with a charge ID and amount. For a sync, that might be "imported" with the local record ID.

Do not delete these records too soon. Keep them longer than your longest retry window, and longer than any manual replay window your team uses. If retries can arrive for 72 hours, a one-day retention period is too short. Many teams keep these rows for at least a week, and payment flows often need longer.

If storage grows fast, archive old rows. Do not throw away the only proof that a job already ran.

Build the flow step by step

Reduce Cleanup Work

Tighten job contracts before duplicate records turn into manual fixes.

Plan Fixes

For background job idempotency in Python, the order matters more than the worker library. A safe flow is simple: check state, lock state, do the outside action once, save the outcome, then ack the queue.

A worker should follow the same path every time a retry happens:

Look up the job by its retry-safe identifier before you call anything outside your system. If you already have a finished result, return it and stop.
Start one database transaction and insert the job row, or lock the existing row. This stops two workers from charging the same card or importing the same record at the same time.
Run the side effect once. That might mean sending one payment request or pulling one page from a partner API. Capture the full response, including provider IDs, status codes, and error details.
Save that result to the database before you ack the queue message. If the worker crashes after the side effect but before the save, the next retry has no proof that the action already happened.
Retry only the parts that stay safe on repeat. Database reads, status checks, and writing the same final result again are usually fine. A second charge or a second "create" call usually is not.

That order handles the ugly case teams hit in production: the payment provider says "success," then the worker dies. On the next run, your code reads the stored row, sees the provider reference, and skips the charge. If you skip the save and ack first, you lose that protection.

Keep the state model small. A row with "started," "finished," and "failed" plus timestamps and external reference IDs is often enough. If a job gets stuck in "started" for too long, another worker can check whether the outside action happened, then finish the row instead of doing the action again.

This is the part many teams overbuild. You do not need a fancy workflow engine to get this right this week. You need one shared table, one transaction around the lock, and strict queue ack timing.

Payment example: charge once and record the outcome

Payments punish sloppy retry logic fast. If a worker crashes after the card charge but before the database write, the next retry can charge the customer again unless every step checks the same order_id.

Use order_id as the stable idempotency key. Keep it the same across retries, worker restarts, and manual replays. If each attempt creates a new UUID, your payment retry handling breaks on the first timeout.

Send that same order_id to the payment provider when you create the charge. The provider can then return the original result for repeat requests instead of creating a second charge. Your app and the provider now track the same payment with the same identity.

Keep the order of operations strict:

Load the order and stop if it is already paid.
Check shared state for order_id and reuse any saved result.
Create or confirm the charge with the provider using that same idempotency key.
Save the provider charge_id and status in your database.
Mark the order as paid only after that write succeeds.

Timeouts need special care. If the provider call times out, treat the result as unknown until you verify the charge. Do not mark it failed, and do not retry blindly.

Picture a real failure: the card charge succeeds, but the worker dies before it stores charge_id. The next job run should not refund or retry first. It should ask the provider whether a charge already exists for that order_id, recover the missing charge_id, save it locally, and then finish the order update.

That repair check should run before any refund or second charge attempt. If the provider shows a successful charge, record it and close the gap in your database. If the provider shows no charge, retry with the same order_id. This is the part of background job idempotency in Python that saves money, support time, and angry emails.

Sync example: import records without duplicates

A sync job goes wrong fast when it treats every incoming record as new. Queues retry, webhooks repeat, and upstream systems send events out of order. If your worker inserts blind, you end up with duplicate customers, broken counts, and hard cleanup work.

Use the source record ID as the thing that stays the same across retries. For a CRM import, that might be source_contact_id. For an ecommerce sync, it could be the remote order ID. Do not build this around a local database ID or a timestamp created by the worker. Those change too easily.

When the worker gets a record, it should upsert it. That means "create the row if it does not exist, or update the existing row if it does." In Python, teams often do this with a unique constraint on source_id and an INSERT ... ON CONFLICT DO UPDATE query. Same source ID, same local row. That rule keeps duplicates out.

You also need one field that tells you which copy of the record is newer. Pick the best option the source gives you:

a version number
a checksum of the source payload
an updated_at timestamp
an event ID if events are ordered

When an old event shows up after a new one, compare that value with what you already stored. If the incoming record is older, skip it and log that choice. Otherwise, update the row. This stops a stale retry from overwriting fresh data.

Keep a small audit trail too. A compact import_events table with source_id, event_id, source_updated_at, checksum, and result is usually enough. Then if record 42 shows up three times, your team can see whether the worker imported it, skipped it as stale, or retried after a failure.

Handle partial failures and cleanup rules

Bring CTO Support In

Use Fractional CTO help when retries, queues, and billing logic start causing real damage.

Book Consult

A retry turns risky when one step finishes and the next one does not. Split outside side effects from local updates on purpose. In a payment worker, call the payment provider and then write the result to your database with the same idempotency key. If the worker crashes between those two steps, do not guess what happened.

For background job idempotency in Python, "unknown" is a real status. Use it when the provider may have accepted a charge, but your app did not save the final state. A small repair job can later ask the provider for the charge status and update your records without charging twice.

Unknown beats wrong

Fake certainty causes more damage than a temporary unknown state. If a charge request may have left your system, keep the row and mark it unknown. If a sync job may have pushed a customer record but failed before saving the external ID, mark that record for recheck instead of creating a fresh one on the next retry.

Placeholder rows help workers avoid double work, but delete them only when no side effect happened. If the worker failed before it sent the API call, you can safely remove the placeholder and retry. If the API call might have gone out, keep the placeholder, store the job state, and let repair logic finish the story.

Cleanup should be small

Broad cleanup scripts create fresh problems. Keep each script narrow, easy to test, and easy to reverse if it touches the wrong rows.

A good repair task usually does only a few things:

filters by job type and a small time window
rechecks unknown states against the external system
updates one status field at a time
writes an audit log with the job key, old state, new state, and reason

That same rule fits sync work. If an import stops after 320 records, repair the uncertain rows first. Do not wipe the whole batch and start over. Cleanups should close gaps, not create new duplicates.

Mistakes that break idempotency

Most idempotency bugs come from small choices that look harmless during a normal run. They show up later, when a worker crashes, a timeout hides a success, or a retry lands hours after the first attempt.

A common payment bug is simple: the retry creates a fresh idempotency key. The worker thinks it is retrying the same charge, but the provider sees a brand new request. If the first call actually worked, the customer gets charged twice.

Another mistake happens earlier in the flow. Teams call the payment provider or third-party API before they save job state in shared storage. If the process dies after the external call but before the save, the next worker has no record of what happened and runs the action again.

These failure patterns cause most of the damage:

A retry generates a new idempotency key instead of reusing the original one.
The job sends the external request before it writes "started" or "in progress" state.
The team keeps deduplication data only in cache, even though retries may happen days later.
The worker treats every timeout as a failed action and retries without checking the provider first.
Cleanup deletes old idempotency rows before the retry window and reconciliation window end.

Cache-only tracking is risky for long retry windows. Redis can help with speed, but it should not be the only source of truth for a payment or a sync import that may retry after a queue delay, a deploy, or a weekend outage. Put the durable record in a database that every worker can read.

Timeout handling needs extra care. A timeout means "you do not know," not "it failed." For payments, ask the provider for the status using the same idempotency key or external reference before you retry. For sync jobs, check whether the target record already exists before you write it again.

Cleanup can also undo good work. If you delete idempotency rows after 24 hours but some retries arrive after 72, you reopen the door to duplicates. Keep rows long enough to cover queue delays, provider delays, manual replays, and chargeback or audit checks. Storage is usually cheaper than cleaning up a double charge.

Quick checks before release

Review Payments And Sync

Oleg can help your team stop duplicate writes without slowing delivery.

Book Call

A job is not idempotent because the code looks careful. It is idempotent when you can break it on purpose and still get one clean result.

Run a few ugly tests before you ship. They catch the bugs that normal happy-path tests miss, especially in payments and sync work.

Start the same job twice at the same time with the same id. One worker should claim it, and the other should see an existing record and stop. If both workers pass, your lock, unique constraint, or claim step is too weak.
Crash the worker after the external action but before the final database write. Then rerun the job. You should still end up with one charge, one email, or one imported record set, not two.
Check what support staff can see. They should find the job by customer, order, or provider reference and read a plain status such as pending, done, failed, or needs review.
Test your repair path for unknown states. If the worker dies after sending a payment request, your team should be able to run a repair task that asks the provider what happened and updates the record without touching SQL by hand.

A small admin view matters more than teams expect. When a customer says, "I think I was charged twice," support needs more than raw logs. They need the job id, request id, external reference, last error, and when the last retry ran.

Repair rules should also be narrow and boring. For example, if a payment record is still pending after 15 minutes, query the payment provider, save the final status, and close the job. If a sync imported 80 of 100 rows before a crash, rerunning it should skip the 80 rows already marked as done and finish the last 20.

If these tests pass, retries stop feeling risky. They become routine.

What to do this week

Start with one job that can cost money or corrupt data if it runs twice. A payment capture job is a strong first target. So is a sync that imports customers, invoices, or subscriptions from another system. Pick a real job that already retries under load, not a clean demo.

Keep the first pass simple and practical:

Add an idempotency table that every worker can check.
Put a unique constraint on the idempotency key.
Force a crash after the external call and before the local save.
Write one repair task for jobs that end in an unknown state.

That small set of changes catches the failures teams usually miss. The crash test matters most. Many bugs hide in the tiny gap between "the provider accepted the request" and "your app recorded the result." If your worker dies in that gap, you need a clear way to recover without charging twice or importing the same record again.

Keep the stored state boring. Save the idempotency key, current status, provider request ID if you have one, last error, and timestamps. That is enough for most Python workers. You do not need a huge framework to get this right.

A repair task should answer one question: what actually happened? For a payment job, it can ask the provider whether the charge exists, then mark the local record as paid or failed. For a sync job, it can re-check which records already landed and finish only the missing writes.

If you want a second opinion on Python job design, retries, or cleanup rules, Oleg Sotnikov does this kind of Fractional CTO work. A short review can spot weak points before they turn into refunds, duplicate records, or late-night cleanup.