Apr 08, 2025·8 min read

Webhook reliability for duplicates, delays, and bad payloads

Webhook reliability starts with clear signatures, safe retries, replay checks, and stable event schemas that keep integrations calm.

Webhook reliability for duplicates, delays, and bad payloads

Why webhooks fail in normal systems

Most webhook failures do not start with a bug. They start with normal network behavior, ordinary timeouts, and two systems making slightly different assumptions.

A sender might deliver the same event twice because it never got a clean "200 OK" back. The receiver may have processed the first request just fine, but a slow response, dropped connection, or proxy timeout makes the sender think it failed. Now the customer sees two invoices, two emails, or two status changes.

Order breaks more often than people expect. Queues slow down, workers restart, and one event takes longer to process than the next. So account.updated can arrive before account.created, or a refund can show up before the payment event that explains it.

Payloads cause a different kind of trouble. One missing field, a null where a string should be, or malformed JSON can stop a parser cold. Even small changes hurt when the receiver assumes every event will always match one exact shape.

Timeouts turn small hiccups into noisy failures. If one downstream API call stalls for 20 seconds, the webhook handler may miss its response deadline. The sender retries, then retries again, and one slow request turns into a retry storm that keeps hitting the same endpoint.

Real systems pile these problems together. A duplicate event arrives late, the payload version changed, and the receiver is already under load from earlier retries. Nothing unusual happened on its own, but the combination creates hours of cleanup.

That is why good integrations treat every delivery as untrusted, out of order, and temporary. Accept that early, and the rest of the design gets simpler.

Build the receiver flow step by step

Most webhook bugs start in the first few milliseconds. A receiver should do as little as possible before it knows the request is real, fresh, and not a repeat.

A clean flow also makes failures easier to explain when a customer asks why an event did not apply.

  1. Read the raw request body exactly as it arrived. Do this before any JSON parser, framework helper, or middleware changes whitespace, encoding, or field order.
  2. Verify the webhook signature against those raw bytes. If the signature does not match, stop there and return an auth error.
  3. Check the timestamp in the signed data. If the request is too old, reject it. That blocks replay attempts and catches deliveries that sat in a queue too long.
  4. Save the event_id in a dedupe store before you trigger side effects. If another copy of the same event shows up, you can detect it early and avoid a second charge, email, or status change.
  5. Only after those checks pass should you parse the JSON, validate the fields you need, and hand the work to your app or queue.

Status codes matter more than many teams expect. Return a 2xx code only when you have accepted the event in a durable way. If the sender signed the request wrong, use 401 or 403. If the payload is malformed or missing required fields, use 400. If your database or queue is down, return 500 or 503 so the sender knows a retry makes sense.

One detail saves a lot of pain: acknowledge fast, then do heavy work in the background. Store the event, mark its ID as seen, return 202 or 204, and let a worker update orders or send emails after that.

This flow is intentionally plain. Plain receivers survive duplicates, delays, and bad payloads better than clever ones.

Design events that stay clear over time

Clear event design saves more support time than most retry logic. If a customer gets an event called order.updated, they still have to guess what changed. A better name points to one business action, such as order.created, order.paid, or invoice.sent.

Each event should carry a few fields that stay put:

  • event_id for the unique event ID
  • type for the event name
  • version for the schema version
  • occurred_at for when the business action happened

That gives the receiver enough context to store, sort, and debug events without reading every field in the payload.

Field types should stay predictable. If customer_id is a string in one event, keep it a string everywhere. If amount is an integer in cents, do not switch to a decimal later. Small type changes break parsers, usually at the worst possible time.

When the schema needs to grow, add fields. Do not rename old ones unless you also keep the old field during a long transition. Receivers often pin their code to the shape they first integrated with. Quiet renames turn a normal deploy into a support ticket.

Keep payloads focused on what the receiver needs to act. A payment webhook usually needs the payment ID, order ID, status, amount, currency, and timestamps. It rarely needs every internal note, UI label, or unrelated nested object.

This part is not flashy, but it matters. Clear event names and stable schemas make duplicate handling, retries, and replay protection much easier because the receiver can trust what each event means.

Sign requests the same way every time

A lot of webhook pain starts before your app reads a single field. If the sender signs one thing and the receiver verifies another, valid requests fail and bad requests slip through. Both sides need one simple rule: sign the exact raw body bytes that went over the wire.

Do not parse JSON first and then recreate it for verification. Parsers can reorder fields, trim spaces, change number formats, or normalize Unicode. The payload may look identical to a person and still produce a different signature. Read the raw request body, build the signed string, verify it, and only then parse JSON.

A timestamp should be part of that signed string. It gives you a clean way to reject old requests, and it makes replay protection much easier. A common pattern is to sign timestamp + "." + raw_body with HMAC-SHA256.

X-Webhook-Signature: t=1712812010,v1=4a8c...,kid=2025-01

That header format matters more than people expect. Pick one format and keep it simple. Document the header name, the timestamp field, the signature version, the hash algorithm, the encoding, and the exact string to sign. If the receiver has to guess whether v1 is hex or base64, the integration is already harder than it should be.

Secret rotation also needs a plan from day one. Give each secret a small identifier such as kid. Let receivers accept the current secret and the previous one for a short overlap period, then send new requests with the new secret and remove the old one on a fixed date. That overlap avoids broken integrations during deploys and clock drift between systems.

If you run multiple environments, keep secrets separate for test and production. Reusing one secret everywhere is a shortcut that causes avoidable mistakes later.

Good webhook signatures are boring in the best way. They verify the same bytes every time, fail for old or changed payloads, and keep working during secret changes.

Make duplicate deliveries harmless

Fix Duplicate Event Risk
Get help adding idempotent handling so repeats do not trigger double charges or emails.

Duplicate delivery is normal. Networks time out, senders retry, and some providers send the same event more than once. If your receiver sends two emails, creates two accounts, or triggers the same workflow twice, the sender caused the repeat, but your app still has to handle it.

Give every event a stable event_id and use that as your dedupe handle. Do not guess from an email address, timestamp, or payload hash. Those can match by accident, and some payload fields change between retries.

Store each processed event_id in durable storage before you do any side effects. A database table with a unique index on event_id is usually enough. Save when you first saw it, whether processing started or finished, and what result you returned.

Do not keep this only in memory or in a short-lived cache. A restart will wipe it, and the next retry will look new.

When the insert succeeds, process the event. When the insert fails because the ID already exists, look up the saved status. If you already finished the work, return 2xx again and stop. Repeats should be uneventful.

Put every outside action behind the same idempotent check. That includes emails, billing calls, CRM updates, and message queues. If your API layer blocks duplicates but a background job ignores the same event_id, you can still create duplicate side effects.

The rule is simple: record the ID, do the work once, and treat every repeat as the same event, not a new request.

Retry without creating a mess

Retries only help when the next attempt has a real chance to work. If the receiver timed out, returned a 503, or briefly hit a rate limit, send the event again. If the payload is wrong, the signature is bad, or a required field is missing, stop and mark it as a permanent failure.

That split keeps webhook reliability sane. Blind retries turn small bugs into long support threads and bury the real error under pages of duplicate log entries.

A simple rule works well. Retry on timeouts, dropped connections, 429 responses, and 5xx responses. Do not retry schema validation failures, bad signatures, unknown event types, or missing required fields. Treat auth errors carefully too. Most of them need a human fix, not ten more attempts.

Space each retry farther apart than the last one. A short first delay is fine, but the later waits should grow fast enough to give the other side time to recover. For example, try again after 30 seconds, then 2 minutes, then 10 minutes, then 1 hour. Add a small random delay so you do not hit the same customer endpoint at the same second every time.

Set a retry window and stick to it. For many systems, 24 hours is enough. Some payment or order flows may justify longer, but endless retries are a bad habit. An event that arrives three days late can do more harm than good if the customer already fixed the issue by hand.

When the window closes, mark the delivery as failed, keep the reason, and make it easy to inspect or resend manually. That gives support teams something concrete to work with.

The plain approach usually wins: retry temporary failures, back off quickly, and stop when the error is clearly permanent.

Stop replays and stale events

A valid signature is not enough. If someone captures a real webhook and sends it again an hour later, the signature can still match unless you also check when the sender created the request.

Put a timestamp inside the signed payload or in a signed header. When the request arrives, verify the signature first, then compare that timestamp with your server clock. Most teams allow a small window, often 5 minutes, to cover network delay and minor clock drift.

If the request falls far outside that window, reject it. Do the same when the timestamp is oddly far in the future. Both cases usually mean a replay, a broken clock, or a proxy that changed the request in transit.

A short-lived record of recent deliveries closes the gap that time checks do not catch. Store the event_id, or a hash of the signature and timestamp, for a short period and check that record before you process the event body. If you see the same value again, treat it as a duplicate instead of creating a second action. Expire old records automatically so the store stays small.

You do not need a large database table for this. A fast cache with short retention often works better, because replay protection only needs recent history.

Logs matter more than most teams think. When you reject a request, write the reason in plain language: "timestamp too old", "timestamp too far ahead", or "event ID already seen". Support can then explain the rejection in minutes instead of digging through raw request dumps for an hour.

Normal retries will still happen. That is fine. Replay protection should block stale or suspicious requests, while idempotent handling should make legitimate duplicate deliveries harmless.

A simple order event example

Check Webhooks Before Launch
Run duplicate, delay, and bad payload checks before you ship.

A store sends an order.shipped webhook to a warehouse app after it buys a label. The payload includes event_id, order_id, label_id, occurred_at, and an optional note. Five seconds later, the same event arrives again because the sender did not get a fast enough 200 response. That is normal.

The warehouse app should verify the signature, parse the payload, and check whether it already handled that event_id. If it did, it should return success and do nothing else. The first delivery prints the label and saves a record like event_id=9f2... processed. The second delivery sees that record and stops. No second label, no double shipment.

A late order.cancelled event is harder. Maybe the customer canceled at 10:12, but the shipping event reached the warehouse first and the cancel event arrived at 10:20. The handler should not guess. It needs a business rule.

If the order is not packed yet, stop fulfillment and mark it canceled. If the label is already printed or the package left, do not cancel shipment automatically. Create a support task or refund review instead.

That rule matters more than clever code. Without it, two people will handle the same case in two different ways.

Optional fields should stay optional. If the note field is missing, the handler should still process the shipment. It can save an empty note and move on. Only reject the event when a field is required for a safe decision, such as order_id, event_id, or occurred_at.

This is where schema design pays off. Keep the event type stable, make the event ID unique, include a clear timestamp, and mark only a few fields as required. That keeps webhook reliability high even when deliveries arrive twice, show up late, or miss a non-essential field.

Mistakes that waste hours

Most webhook bugs start as small shortcuts. They turn into long support threads later, usually when a customer says, "your signature check fails only in production."

One common mistake is verifying a changed request body instead of the raw bytes you actually received. If your framework trims whitespace, reorders JSON fields, or normalizes line endings before the signature check, you will reject valid events. Verify the exact raw body first, then parse it.

Another easy way to lose time is mixing test secrets with live traffic. A staging endpoint that sometimes gets production deliveries will fail in ways that look random. Keep separate endpoints and separate secrets, and make your logs show which environment handled the request.

Status codes cause more damage than people expect. If a payload is malformed or misses a required field, do not return 500 unless you want endless retries for the same broken event. Use a 4xx response for permanent problems, record the reason, and stop the loop.

Schema changes also break customer code quietly. Renaming customer_id to user_id without a new version can take down an integration that worked fine yesterday. If you need to change field names, add a version, keep old fields for a while, and give customers time to update.

Logs often make webhook reliability worse instead of better. If you hide the original event ID, nobody can trace duplicates, retries, or support complaints across systems. Put the event ID, delivery ID, signature result, and environment in every log entry tied to that request.

These are small details, but they decide whether an integration feels calm or chaotic.

Quick checks before launch

Cut Support Cleanup Time
Fix unclear errors, weak logging, and retry loops before they turn into long threads.

A webhook can look fine in a test tool and still fail in its first week under real traffic. Before you ship, run one event all the way through your system and make sure a human can follow it without guessing.

Pick a single event ID and trace it from the sender log to the receiver log. You should see when you created it, when you sent it, how many times you retried it, and what the receiver did with it. If your team cannot answer "what happened to event 8f3..." in a minute or two, support will struggle later.

Use one real test event and check a few things. Send the same event twice and confirm the receiver applies it safely once. Return a permanent error on purpose, such as an invalid account ID, and confirm the sender stops retrying. Add a new optional field to the payload and confirm an older client still works. Reject one request with a bad signature or broken JSON and confirm support can see the exact reason. Make sure logs show both the event ID and the delivery attempt ID.

The duplicate test matters more than many teams think. Retries, timeouts, and manual replays happen in normal systems. If the receiver creates two orders, two emails, or two refunds, the bug gets expensive fast.

The versioning test matters too. Optional fields should be easy to ignore. If one extra field breaks an older client, the schema is too fragile.

Finally, look at the error message your own support team will see. "400 bad request" is not enough. "Rejected: timestamp too old" or "Rejected: missing customer_id" saves real time, especially when a customer insists they sent the right payload.

Next steps after the first working version

A webhook that works once is only a draft. Real reliability shows up when another team sends the same event twice, sends it late, or sends JSON that breaks your parser at 2 a.m.

Write a short contract for every event type you accept. Keep it plain and specific. Name the event, list required fields, show one real example payload, define the event ID, and state how you sign requests and how long a timestamp stays valid. Say what your system does when a field is missing or the payload is too old.

A short contract saves more time than a long spec. Most teams do not need pages of theory. They need one clear reference they can match against logs and tests.

Test the ugly cases on purpose

Do not stop at the happy path. Break the receiver in the same ways customers will break it later. Send the exact same event three times and confirm you store one result. Delay an event so it arrives after a newer state change. Remove a required field or change a field type. Corrupt the body so signature checks fail. Replay an old signed request outside your allowed time window.

Run those tests in staging, then keep them in CI if you can. A simple replay drill is worth doing before you onboard anyone. Capture one signed request, resend it inside the allowed window, resend it again after the window closes, and confirm your logs make the outcome obvious.

If your team needs ten minutes to explain why an event was accepted, ignored, or rejected, customers will struggle too. Tighten the logs, error messages, and event contract until the answer is clear.

If you want a second opinion on the flow, Oleg Sotnikov at oleg.is works with startups and smaller teams as a fractional CTO and advisor. A short review of signatures, retries, and event handling can catch weak spots before they turn into customer support threads.

Frequently Asked Questions

Why do webhooks show up twice?

Because the sender often retries when it does not see a clean 2xx response fast enough. Your app may have finished the first request, but a timeout or dropped connection can still make the sender try again.

What should my endpoint do before it parses JSON?

Read the raw body first and verify the signature on those exact bytes. After that, check the timestamp and event_id, then parse JSON and hand the work to your app or queue.

Should I reply before I finish the real work?

Yes, if you store the event in durable storage first. Return 202 or 204 quickly, then let a worker send emails, update orders, or call other APIs in the background.

How do I verify webhook signatures without random failures?

Sign and verify the exact raw request body, not re-serialized JSON. A simple pattern is timestamp + "." + raw_body with HMAC-SHA256, plus a short time window for freshness.

Which status codes should I return for webhook errors?

Use 400 for malformed JSON or missing required fields, and use 401 or 403 for auth or signature problems. Return 500, 503, or sometimes 429 only when a retry could actually work.

How do I stop duplicate emails, charges, or status changes?

Give every event a stable event_id and save it in durable storage before any side effect runs. If the same ID arrives again, return success and skip the second charge, email, or update.

Should I retry every failed delivery?

No. Retry timeouts, dropped connections, 429, and 5xx responses, but stop on bad signatures, broken payloads, unknown event types, or missing required fields.

How do I block replay attacks on webhooks?

Check a signed timestamp and reject requests that fall outside a small window, such as five minutes. Keep a short record of recent event_id values too, so an old but valid request does not trigger work again.

What makes a webhook schema stable over time?

Add fields instead of renaming or changing old ones in place. Keep event names specific, such as order.paid, and include stable fields like event_id, type, version, and occurred_at.

What should I log so webhook issues are easy to debug?

Log the event_id, delivery attempt ID, signature result, environment, response code, and the plain reason for any rejection. Clear messages like timestamp too old or missing customer_id save a lot of support time.