Jan 03, 2026·7 min read

Architecture review failure modes before naming services

Learn how to use architecture review failure modes to discuss loss, delay, duplication, and bad state first, so teams spot risk before drawing boxes.

Architecture review failure modes before naming services

Why teams get stuck when they start with services

Most architecture reviews go wrong in the first ten minutes. Someone draws an API, a queue, two databases, and a cache. Then the room starts arguing about tools, boundaries, and whether one box should become three.

It feels productive because the diagram grows fast. Usually, it is a trap.

The team starts naming parts before it agrees on what can fail and what that failure would do to a user or the business. A checkout flow shows this clearly. One person wants a payment service. Another wants an order service, an event bus, and a retry worker. Meanwhile, the harder question never gets asked: if payment succeeds but order creation is delayed, who notices, what does the customer see, and how do you fix it without charging twice?

When teams start with services, the design drifts toward personal preference. People defend the tools they know. They debate direct calls versus queues, but they do not name the actual harm: lost orders, late emails, duplicate shipments, or a record that says "paid" when nothing was shipped.

That leaves the dangerous paths out of the room. Clean diagrams rarely show quiet failures. A message arrives twice. A timeout hides a success. A retry updates one table and skips another. These are the paths that wake people up at 2 a.m., yet they often get almost no attention.

The meeting ends with a tidy picture and no hard decisions. The team still does not know which actions must be atomic, where idempotency matters, what can be eventually consistent, or which failures need an alert right away.

A better review starts with damage, not boxes. Ask what loss, delay, duplication, and bad state look like in one flow. Once the team agrees on those risks, the diagram usually gets smaller and clearer. Some boxes disappear. Others finally have a reason to exist.

The four failure modes to name first

Most review problems fit into four simple buckets.

Loss is the easiest to explain and one of the easiest to miss. A request, event, or update disappears, and nobody notices until something downstream looks wrong. A user clicks "Pay," but the order record never lands. Or an inventory update vanishes between systems.

Delay is different. The work does arrive, but too late to help. A five second delay might be harmless for an email. The same delay can break a stock hold or a fraud check. Timing matters more than many teams admit.

Duplication sounds minor until money or side effects get involved. The same action runs twice. The system sends two emails, creates two shipments, or charges the same card again. If a team does not name that risk early, it often builds something that looks fine in a demo and causes real support pain later.

Bad state is the hardest one because nothing looks obviously broken. The screens load. The services report success. The data exists. But the story the data tells is wrong. An account shows "active" while billing failed. Stock looks available after two systems drift apart. Support, finance, and the customer all see different versions of reality.

These four labels keep the discussion grounded:

  • Loss: "What if it never arrives?"
  • Delay: "What if it arrives too late?"
  • Duplication: "What if it happens twice?"
  • Bad state: "What if every step succeeds, but the final picture is wrong?"

Once people use those words, the review improves fast. Instead of arguing about Kafka, gRPC, or cron jobs, they ask better questions about retries, time limits, idempotency, ownership, and repair paths.

A simple way to run the review

Keep the scope tight. Pick one user action, such as "customer places an order" or "user submits a refund request," and ignore the rest of the platform for now.

Then write the normal path in one sentence. Make it plain and boring: "The user submits the order, the system charges the card, saves the order, and sends a confirmation." If the team cannot agree on that sentence in two minutes, the design is already fuzzy.

A good review follows a simple order. Choose one user action. Write the happy path in one sentence. Ask how loss, delay, duplication, and bad state could happen. Write the user impact for each case. Only then choose controls.

That order matters. If someone jumps straight to "use Kafka" or "add retries," pause the room and bring it back to the failure. Ask what exactly gets lost, what arrives late, what runs twice, or what leaves the system in the wrong state.

Write the impact in user terms before you discuss tools. A delayed confirmation email is annoying. A delayed stock reservation can sell an item you do not have. A duplicated payment is worse than both, because finance, support, and the customer all feel it at once.

This changes the tone of the meeting. People stop defending favorite tech and start talking about consequences. That shift leads to better decisions.

After the group agrees on the failure and the impact, pick the control that fits the problem. Duplication may need idempotency. Delay may need timeouts, retries, or a status screen. Bad state may need reconciliation or a manual review path. You do not need every safeguard. You need the ones that match the damage you are trying to avoid.

Questions that keep the discussion grounded

Good reviews stay concrete. The fastest way to keep them there is to ask plain questions before anyone argues about queues, services, or storage.

Start with user loss. If this step fails, what does the user actually lose? Money, time, trust, data, or a little convenience are not the same thing. A failed email send matters less than a lost payment record, and the room should say that out loud.

Then ask about delay. If this step stalls for 30 seconds, 10 minutes, or 2 hours, when does someone notice? Some work can wait for a nightly retry. Other work breaks the product in a way users feel right away. That time limit shapes the design more than a tool debate ever will.

Next comes duplication, and teams often skip it. If the step runs twice, do you get two charges, two shipments, two welcome emails, or no real harm? Many distributed system failures look ordinary at first. The problem is not always a crash. Often, it is a retry that quietly does the same thing again.

Bad state needs a clear owner too. Ask who sees the wrong state first. It might be the customer looking at a stale order page. It might be support when tickets spike. It might be finance when totals stop matching. Once you know who notices first, you can decide where checks belong and what needs to reconcile later.

Then ask about signals. What tells the team a problem started? An error rate, a queue that keeps growing, missing records, duplicate records, or a customer message are very different signals. If nobody can name one, the review is still too abstract.

A short exercise helps. Take one step in the flow and answer five prompts in a minute each: what the user loses if it breaks, how long it can wait, what happens if it runs twice, who sees the wrong state first, and what signal proves something went wrong.

If a team cannot answer those clearly, it is too early to name services. The group still does not agree on the risk.

A simple example with an order flow

Turn Diagrams Into Decisions
Work with Oleg to map loss, delay, duplication, and bad state before naming services.

A customer buys the last pair of running shoes in your store. Payment goes through, stock should drop by one, the order should appear in the system, and a confirmation should reach the customer. It sounds routine, which is exactly why it works well in a review.

Start with loss. If the payment provider says "paid" but your app never creates the order, the customer sees a charge while your team has nothing to ship and no clear record to fix.

Delay hurts in a quieter way. Maybe the order exists, but stock updates 10 minutes later because a queue backs up or a retry loop moves too slowly. The site still shows the item as available, someone else buys it, and support gets dragged into the mess when one of those orders has to be canceled.

Duplication gets expensive fast. A customer taps "Pay" twice after a slow spinner, or your system retries a request without checking whether it already finished. Now the card charges twice, two order records appear, or the warehouse ships the same item twice. People like to call this an edge case. It is common enough to plan for.

Bad state is messier. Checkout says the order is paid, but fulfillment says unpaid. Finance sees one status, the warehouse sees another, and the customer already got a confirmation email. Nothing is fully lost, yet nobody agrees on what is true.

This is where the review gets sharper. Instead of arguing about service names, the team can ask direct questions. Which step needs an idempotency token? When should stock change? Which system owns the final answer for "paid"? What should the customer see if payment clears but order creation fails?

If the team cannot explain how it detects loss, limits delay, blocks duplication, and repairs bad state, the design is still too early.

Mistakes that waste the review

The first mistake is starting with tools. Someone says queue, cache, webhook, worker, and read replica before anyone names the actual failure. It sounds concrete, but it hides disagreement. One person worries about lost data, another worries about slow updates, and a third worries about duplicate actions. They can talk for half an hour and still mean different problems.

Another common mistake is mixing delay with loss. They are not the same. If an order confirmation arrives 20 minutes late, you might need retry rules, timeouts, or a clearer status screen. If the confirmation never arrives, you need a different fix. If it arrives twice, you need idempotency or deduplication. When people dump all of that under the label "reliability," the discussion gets fuzzy and the design stays vague.

Vague labels make things worse. Phrases like "consistency issue" or "edge case" sound smart, but they do not tell the team what can break. A better statement is blunt: "The customer paid, but the order never appeared." Or: "The shipment was created twice." Or: "Support cannot tell whether the retry already ran."

Teams also stop at detection and skip recovery. That gets expensive. A system can notice a bad state and still leave staff with no safe way to fix it. Manual work matters here.

Ask a few direct questions. Who sees the problem first: the customer, support, or the system? What can a human do in five minutes to reduce the damage? What record proves the correct state? Who writes the retry, backfill, or cleanup step? Who owns the follow up after release?

The last mistake is simple: nobody leaves with ownership. A review without owners is just a conversation with diagrams. Give each failure mode one person, one next action, and one date. If no one owns the retry path, the alert, or the support playbook, the same issue will show up again at the next meeting.

How to turn failure modes into design choices

Support A Lean Team
Use Fractional CTO support to simplify systems and keep shipping with less cleanup.

A review becomes useful when each failure mode leads to one clear design response. Do not stop at "this might fail." Decide what the system will do when it fails, how fast it must react, and who gets notified.

A simple default works well. For loss, store the intent before work starts or keep a durable record that the action happened. For delay, set a time limit, decide what happens after it expires, and alert the team if the delay changes the result. For duplication, add duplicate checks so retries and repeated clicks do not create the same action twice. For bad state, define a repair step for partial failure, such as retrying, rolling back, or marking the item for review.

Keep the response simple. If a team names three controls for one small risk, it usually means the group does not trust any of them.

Delay needs extra care because slow systems often look fine until money, stock, or customer trust starts slipping. Put numbers on it. If a task is harmless at 10 seconds but harmful at 2 minutes, write that down. Then put the timeout and the alert at that threshold, not somewhere later.

Partial failure needs the clearest rule of all. Write one sentence that says what the system does if step one works and step two fails. For example: keep the first result, mark the record as incomplete, and retry step two for 15 minutes before sending it to manual review. That is far better than "handle errors gracefully."

Keep the output reusable

The notes should stay short enough to paste into the next review without cleanup. A compact format works well: failure mode, trigger, response, owner, and alert.

If the note does not fit on a few lines, it is probably still too vague or too complex. Short notes also make tradeoffs easier to revisit later when the system changes.

A quick checklist before the meeting ends

Plan Your Repair Paths
Define what happens after partial failure, who fixes it, and when the team gets alerted.

A good review should end with plain answers, not a cloud diagram that everyone interprets differently. If people still argue about service names but cannot say what failure hurts the business most, stop there and tighten the discussion.

Before the meeting ends, make sure the group can state the biggest loss risk in one sentence. Put a real time limit on harmful delays. Decide how the system blocks double processing. Check for bad state, not only missing state. Leave with owners and next actions.

One small habit helps a lot: read those answers out loud. Weak spots become obvious fast. "We will handle duplicates somehow" is not a decision. "Sam will add idempotency on payment callbacks by Friday" is.

This part of the review should feel a little strict. That is a good sign. A team that can name loss, delay, duplication, and wrong data in plain language usually builds cleaner systems and wastes less time in follow up meetings.

What to do next with your team

Pick one flow that already hurts users or creates manual cleanup. Good candidates are checkout, signup, password reset, invoice generation, or anything that leads to refunds, duplicate records, or support tickets. If the pain is current, people stay focused because they can picture the damage.

Keep the first meeting short. Thirty minutes is enough. Use one whiteboard or one shared canvas and walk through the flow in plain steps. Write what happens first, what changes data, what sends messages, and what users see.

Then ask the hard questions before anyone names services. Where can work get lost? Where can it show up late? Where can it happen twice? Where can the system end up in the wrong state? That is the part many teams skip, and it is why reviews often turn into diagram debates instead of design choices.

Wait until the end to draw the service diagram. Teams waste time when they jump straight to boxes, arrows, queues, and repo boundaries. Once the failure modes are on the board, the diagram gets smaller and more useful. You are no longer guessing what the design should do. You are asking how it prevents a specific mess.

A practical first session is simple: choose one user flow with recent pain, write the flow in plain language, mark loss, delay, duplication, and bad state at each step, sketch services only after the risks are visible, and leave with one owner and one next test for the riskiest point.

If your team keeps circling the same abstract arguments, an outside review can help. Oleg Sotnikov at oleg.is works with startups and smaller teams as a Fractional CTO, helping them turn vague architecture talk into concrete decisions about failure handling, systems, and delivery.

After the meeting, test one fix on the same flow within a week. Even a small change, like an idempotency check or a timeout rule, will teach you more than another long review.

Frequently Asked Questions

Why should a review start with failure modes instead of services?

Because service boxes hide the real risk. If you start with loss, delay, duplication, and bad state, the team talks about user harm first and picks controls for a real problem instead of defending favorite tools.

What are the four failure modes in a simple way?

Loss means something never arrives or never gets saved. Delay means it arrives too late. Duplication means the same action runs twice. Bad state means every step seems fine, but the final data tells the wrong story.

How do I run a short architecture review without it turning into a tool argument?

Pick one user action and write the normal path in one sentence. Then ask how that flow could get lost, arrive late, run twice, or end in the wrong state. After the team agrees on the damage, choose the control, owner, and alert.

What is the difference between loss and delay?

Loss means the system drops the work and no one can finish it later. Delay means the work still arrives, but the timing already caused harm. That difference changes the fix: lost work needs a durable record or recovery step, while slow work needs time limits, retries, or a better user status.

Why does duplication cause so many problems?

Retries, double clicks, and timeouts cause it all the time. One duplicate can charge a card twice, create two shipments, or flood support with cleanup work. Teams should treat duplicate handling as a normal part of the design, not a rare corner case.

What does bad state look like in a real product?

Bad state shows up when systems disagree about what is true. A checkout page may say paid while fulfillment shows unpaid, or finance sees totals that do not match orders. You need one source of truth and a repair step when parts of the flow drift apart.

When should we talk about queues, services, or storage choices?

Wait until the team can name the failure, the user impact, and the time limit. Once those answers are on the board, tools become easier to judge. A queue, direct call, or cron job only makes sense when you know what damage you are trying to stop.

What should the team write down before the meeting ends?

Leave with plain decisions, not just a diagram. Write the biggest risk in one sentence, the control for it, who owns the next action, and when the team will test it. If no one owns the retry path, alert, or cleanup step, the same issue will return.

What is a good first flow to review with the team?

Choose a flow that already hurts users or creates cleanup work. Checkout, refunds, signup, password reset, and invoice generation work well because people can picture the damage fast. Keep the first session small so the team stays focused on one story.

When does it make sense to ask an outside expert to review the architecture?

Bring in outside help when the team keeps circling the same abstract arguments or ships fixes that do not solve the real failure. An experienced Fractional CTO can push the room toward concrete decisions, ownership, and practical safeguards for the flow that hurts most.