Mar 30, 2025·8 min read

Model router failure codes that explain every handoff

Model router failure codes help support and product teams see why requests moved, spot weak prompts, and fix routing rules that cause repeat failures.

Model router failure codes that explain every handoff

Why hidden handoffs create messy support work

A router sends a request to one model, then quietly switches to another. The user only sees the result: a slower reply, a different tone, or an answer that misses the point. Support gets the complaint, but not the cause. Product sees quality drop, but not the trigger.

A silent fallback is better than a full failure. It keeps the app moving. But it doesn't explain what went wrong. Did the first model time out? Did the request exceed the context limit? Did a provider rate limit kick in? Did the router switch because the task needed a tool the first model couldn't use? A log that says fallback_used tells you almost nothing.

That gap creates messy support work fast. An agent may tag the ticket as 'bad answer' when the real issue was a provider outage. A product manager may think the prompt needs work when the router actually switched because the request was too large. Two teams look at the same case and head in different directions.

Vague reasons make this worse. Logs full of labels like error, retry, or other sound harmless when you build the system, but they get expensive later. They hide patterns, mix unrelated problems together, and slow down triage. If ten handoffs share the same label, nobody knows whether to fix prompts, budgets, limits, tooling, or vendor settings.

Router failure codes need to be clear enough for humans, not just machines. A support agent should be able to read the handoff reason and explain it in one sentence. A product person should be able to group those reasons and spot what keeps breaking. Good logs turn routing handoffs from guesswork into something teams can act on.

The goal is simple. Every handoff should answer one plain question: why did the router switch this request to another model? When that answer is clear, support responds faster, product fixes the real issue, and the same ticket is less likely to come back next week.

When a handoff needs a failure code

A handoff needs a failure code any time the first path could not finish the job and the router had to change course. That includes switching to another model, calling a fallback flow, asking for a tool, or sending the case to a person. If nobody can tell why the change happened, the log won't help support or product fix anything.

Use a code when the handoff answers a real question. Why did the cheaper model give up? Why did the router skip the normal path? Why did the system stop instead of trying again? A good code turns a vague event into a specific reason a team can act on.

A few moments almost always deserve a code:

  • the router switched models after a failed attempt
  • the router sent the task to a tool and the tool did not return usable data
  • the system stopped because it ran out of time or budget
  • the answer failed checks and the router had to retry or escalate
  • a policy rule blocked the request even though the model could have answered

Keep user mistakes separate from router mistakes. If a user sends missing details, conflicting instructions, or a request your product doesn't support, put that in one group. If the router picked the wrong model, used a bad threshold, or sent the task down the wrong path, put that in another. Different teams own those problems.

Do the same for model limits and policy blocks. A model that can't fit the context, can't follow the output format, or can't solve the task is not the same as a hard rule that blocks medical advice, regulated content, or protected data. Both can end in a handoff, but they don't have the same fix.

Small code differences matter. timeout is not cost_cap_hit. tool_error is not bad_output. bad_output is not policy_block. In support work, those labels change who owns the problem. Infra teams fix slow calls. Product teams fix broken tool paths. Prompt or model owners fix invalid JSON, weak answers, and other output failures.

If one handoff has two problems, log the first reason that forced the switch. Keep that rule strict. Clean labels beat clever labels every time.

The first set of codes to define

Start small. Most teams do better with five clear codes than with thirty labels that overlap. If you launch with too many, people stop trusting the data because the same handoff gets tagged three different ways.

A good starter set covers the most common reasons a request leaves its first path. Keep the names plain, keep them short, and make each one mean one thing.

  • timeout - the model or a dependency took too long. Action: retry once, then check latency and queue limits.
  • tool_error - the router called a tool and the tool failed. Action: inspect the tool logs and fix the integration.
  • policy_block - the request hit a safety or business rule. Action: review the blocked input and the rule that stopped it.
  • bad_format - the output arrived, but it broke the schema or missed required fields. Action: tighten the prompt, schema, or parser.
  • low_confidence - the router got an answer, but the score was too weak to trust. Action: send it to a stronger model, fallback flow, or person.

These work because each one points to a different owner. A timeout often belongs to infra. A tool error usually lands with the team that owns the integration. A policy block may belong to product, legal, or trust and safety. Bad format often means prompt or parser work. Low confidence usually means routing logic or evaluation needs attention.

Avoid labels like failed, other, or model_issue. They save a few seconds when you name the event, then waste hours later because nobody knows what to fix. The code should answer one plain question: what should the team check first?

It also helps to separate cause from result. 'Handoff to larger model' is a result, not a failure code. low_confidence is the cause. That distinction matters when teams review hundreds of routing events.

If a new code doesn't change the next action, you probably don't need it yet.

How to add codes to your router step by step

Track one handoff event first, not every strange thing your router might do. Pick the moment that creates the most confusion for support or product, such as when the first model gives up and a second model takes over. If you try to label every edge case on day one, people stop trusting the data.

For each handoff, write down the single reason you want the system to report. Keep that reason plain. 'Context too long' is better than 'input processing constraint'. 'Safety policy match' is better than 'compliance escalation'. Good codes read like something a support lead can understand at a glance.

A simple pattern works well:

  1. Name the event in plain language, such as 'router switched from default model to fallback model'.
  2. Assign one code for the main cause, not a stack of causes.
  3. Add one short sentence that says what triggered it.
  4. Store that code and sentence in the same log record as the handoff.

That short sentence matters a lot. A code like ctx_limit helps with counting, but the sentence explains the line the router crossed, such as 'conversation reached token limit after the customer pasted a refund policy and three email threads'. Support can read that fast. Product can act on it.

Then test the draft codes on old router logs and closed support tickets. Look for two problems. First, some handoffs won't fit any code. Second, too many handoffs will fit several codes at once. Both mean the list is still fuzzy.

If two codes often describe the same event, cut one. If people argue about which code to use, rewrite both until the choice feels obvious. A small list that everyone uses the same way beats a long list nobody remembers.

One practical rule helps: each handoff should have one primary code, one short trigger sentence, and no debate after the fact. If a support manager and a product manager read the same ticket and pick different codes, the code set still needs work.

What to record with each handoff

Make support triage clearer
Set up logs your team can use without waiting on engineering.

A handoff record should answer one question fast: why did the router switch models, and what did the user experience because of it? If your log only says 'sent to fallback model', support can't tell whether the problem came from the prompt, a tool failure, a timeout, or a bad model choice.

Useful records stay small, but not vague. In most teams, five pieces of context do the job:

  • the source model and the target model
  • the trigger for the switch, the failure code, the timestamp, and the request type
  • the prompt or template version, plus any tool, function, or retrieval step involved
  • the user impact, such as delay, retry, partial answer, hard error, or no visible issue
  • one short human note that support can read in seconds

That last field is easy to skip and expensive to lose. A note like 'tool schema mismatch after billing lookup' is much better than a generic code that forces someone to dig through raw logs. Keep it brief and factual. One sentence is enough.

Prompt version is another field teams skip too often. If version 18 starts causing handoffs and version 17 did not, you have a clear place to look. The same goes for tools. If most handoffs happen after one retrieval step or one API call, the model may not be the real problem.

Keep user impact separate from the technical cause. A timeout that triggers a fallback might be harmless if the user still gets a good answer one second later. The same timeout matters much more if the user waits eight seconds, sees an error, and retries.

If you collect only five things, collect what changed, why it changed, when it changed, what the user felt, and one plain-English note. That's enough for support and product to start making decisions.

A simple example from a support workflow

A customer writes to support: 'I was charged twice and I need a refund.' The router sends the message to the cheapest model first. That model is fast and cheap, so it handles the first pass for most tickets.

Its job is narrow. It must detect the issue type and extract the fields the refund flow needs, such as account email, order ID, and charge date. In this case, it labels the ticket as a refund request, but returns only the email and a vague summary.

The router shouldn't just hand the case to a stronger model and move on. It should attach a code like missing_required_fields and include which fields were absent. That tells the team the cheaper model didn't fail in some vague way. It failed on a specific contract.

A short event trail might look like this:

  • cheap model: classify refund ticket and extract fields
  • handoff code: missing_required_fields with order_id and charge_date
  • mid-tier model: retry extraction and start order lookup tool
  • handoff code: tool_timeout with tool name and wait time
  • premium model: ask one follow-up question and park the case safely

The second handoff matters even more. The mid-tier model does a better extraction job, finds the likely order, and calls the billing tool. The tool stalls for 12 seconds and times out. The router escalates again, this time with tool_timeout plus the tool name, request ID, and timeout length.

Without those codes, people often blame the fallback model because it was the last thing in the chain. That's a bad habit. The fallback only absorbed the mess.

With the codes in the logs, support sees that customers often omit order IDs in refund emails, so the intake form needs a clearer prompt. Product sees that the billing lookup endpoint slows down during peak hours, so the team fixes the service instead of tuning prompts for days. The handoffs stop looking random. They point to the real issue.

Mistakes that make the data useless

Fix confusing failure logs
Get help naming handoff codes that support and product can read fast.

Bad routing data wastes time twice. Support reads the handoff log, shrugs, and guesses. Product sees a trend report, trusts it, and fixes the wrong thing.

The most common problem is lazy labeling. If half your handoffs end up under other, you don't have a taxonomy. You have a trash can.

A few mistakes cause most of the damage:

  • One catch-all code takes over. When other becomes the default, nobody learns whether the handoff came from bad retrieval, a timeout, a policy block, or a weak answer.
  • Cost and quality get mixed together. too_expensive and low_confidence are different events. One is a budget rule. The other is a performance rule.
  • Prompt bugs get hidden under model_failed. That's wrong. If your router passed a broken prompt, the model wasn't the first thing to fail.
  • Nobody can explain the code in plain words. If a support lead can't read a code and explain it to a new teammate in one sentence, the code is too vague or too technical.
  • Code meanings keep changing. If needs_escalation means one thing in April and another in June, your trend line is dead.

Plain language helps more than clever naming. 'Context window exceeded' is fine. 'Semantic degradation event' says almost nothing.

A simple test works well: give the code list to someone from support and someone from product. Ask both to sort ten real handoffs the same way. If they disagree often, your codes need cleanup before launch.

Stable definitions matter as much as the codes themselves. Freeze the meaning, write one short example for each code, and only add a new one when the old set clearly can't describe a repeated issue. That keeps the data boring, which is exactly what you want.

Quick checks before you ship it

Tighten AI support routing
Work with a Fractional CTO on routing rules, escalations, and safer support flows.

If your router failure codes are hard to read, people will stop using them. A support agent should understand a code in a few seconds and know what to do next. If they still need to ask engineering what fallback_reason_7 means, the code failed its job.

Each code should point to one clear action. 'Context too large' might mean trim the conversation and retry. 'Safety block from upstream model' might mean send the case to manual review. When one code can mean three different fixes, support will guess and the data will get noisy fast.

A short pre-launch check catches most of the mess:

  • Read five sample codes out loud to someone on support. If they hesitate, rename them.
  • Match every code to one next step. If two teams would handle it differently, split the code.
  • Filter logs by code and prompt version together. That's how product teams catch bad prompt releases.
  • Record the user effect, not just the system event. 'User saw 12 second delay' matters more than 'router retried twice'.
  • Put repeated handoffs in one view so you can spot loops, retries, and dead ends.

That last point is easy to miss. One handoff can look harmless. Five handoffs on the same request usually mean the router is confused, the prompt is weak, or a downstream model keeps failing in the same way. You want one screen that shows the full path, in order, with timestamps.

A small support example makes this obvious. A refund request starts on a cheap model, moves to a larger one, then gets sent to a rules-based flow. If the log only says 'rerouted', nobody knows why. If it shows 'intent low confidence', then 'tool response too slow', then 'manual review required', support can explain the delay and product can see where the request went off course.

Before release, pull one day of test traffic and ask two simple questions. Can support act on these codes without help? Can product group failures by code, prompt version, and user impact? If the answer is no, fix the names and logging before real users hit the router.

Next steps for support and product teams

A code list matters only if people use it every week. Start with the last month of handoffs and sort by frequency. When the same code keeps showing up in the same flow, you have a real pattern, not random noise.

Look at those repeated failures through the user's eyes. A router problem is rarely only a router problem. The cause may sit in a prompt that asks for missing context, a tool that times out, or a policy that blocks a request that should pass. Good failure codes save support from guessing and give product a clear place to look.

A simple working routine is enough:

  • pull the top repeated codes from the last month
  • pick one code and trace it back to a prompt, tool, or policy
  • fix the routing rule that creates the most user pain first
  • rewrite code names and descriptions in plain language both teams understand

User pain should decide the order. If customers asking for refunds keep getting handed off twice before they reach the right path, fix that before you spend time on a rare fallback that only affects internal users. Frequency matters, but frustration matters more.

Support and product should also share the same short glossary. If support reads tool_context_missing and product reads retrieval_mismatch, the code is already doing a bad job. Name failures in words a new teammate can understand in ten seconds. That makes triage faster and cuts down on debates that go nowhere.

A small monthly review is enough for most teams. Pick one code, check a few real conversations, and decide where the change belongs. Sometimes the fix is a prompt edit. Sometimes it's a new tool retry. Sometimes the router rule itself is wrong.

If the same routing problem keeps bouncing between support, product, and engineering, outside help can save time. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor on AI-first development, infrastructure, and automation, which can be useful when handoff problems span prompts, tools, and ops.

Frequently Asked Questions

What is a router failure code?

A router failure code names the reason the router left its first path. It tells support and product whether the switch happened because of a timeout, a tool problem, a policy rule, bad output, or weak confidence.

When should I log a handoff code?

Log a code any time the first path could not finish the job and the router had to change course. That includes switching models, retrying through another flow, calling a tool that failed, or sending the case to a person.

How many failure codes should I start with?

Start with a small set, usually around five codes. Teams sort events more consistently when each code means one thing and points to one first action.

Which codes make sense for a first version?

A solid first set covers timeout, tool_error, policy_block, bad_format, and low_confidence. Those codes map to different owners, so people know where to look first instead of arguing about the cause.

Should I separate user mistakes from router mistakes?

Yes. Keep user issues like missing details or unsupported requests separate from router issues like bad thresholds or wrong model choice. Different teams fix those problems, so mixing them slows triage.

What should each handoff record include?

Record what changed, why it changed, when it changed, what the user felt, and one short note a human can read fast. Include the source and target model, the code, the timestamp, the request type, the prompt version, and any tool involved.

How do I choose one code when several problems show up?

Pick the first reason that forced the switch and use that as the primary code. If the router hit tool_timeout before anything else went wrong, log that first and keep the rule strict.

Why is using other a bad idea?

other hides patterns and turns real issues into noise. If half your traffic lands there, support starts guessing and product teams fix the wrong thing.

How can I test whether my codes make sense before launch?

Run old tickets and router logs through the draft codes, then ask one person from support and one from product to label the same cases. If they disagree often, rename the codes or split the fuzzy ones.

What should I do if the same handoff problem keeps coming back?

Watch for repeats by code, prompt version, tool, and user impact. When the same issue keeps bouncing between teams, fix the shared cause first, and bring in outside help if prompts, tools, and ops all play a part.