Feb 25, 2025·8 min read

Model federation for real products: routing rules that work

Model federation lets teams route each task to the right model with clear limits for cost, speed, privacy, and fallback behavior.

Table of Contents

Why demos fail in production

A demo proves one narrow point: one prompt can work once. A product has to survive messy inputs, repeat users, long threads, traffic spikes, and tasks that change from one request to the next. That is when routing stops feeling like a neat model trick and starts looking like an operations problem.

In a demo, someone chooses a prompt that already behaves well. Real users do not. They send short questions, vague questions, angry messages, screenshots, partial forms, and requests with missing details. One message needs a quick summary. The next needs careful reasoning, tool use, or a firm refusal because the data is sensitive. One model and one routing guess rarely hold up under that mix.

Users also judge the product by things a demo can hide. They notice when a reply takes 12 seconds instead of 2. They notice when the answer sounds sure but gets a small fact wrong. They notice when private data seems to move somewhere it should not. Those are product problems, not benchmark problems.

Cheap models often look good on average requests. Then they fail on the cases that create support tickets: noisy screenshots, long email chains, policy edge cases, and prompts with hidden ambiguity. If your router sends all low cost traffic there, you save money until retries, escalations, and churn wipe out the gain.

The strongest model has the opposite problem. It can handle more hard cases, but most routine work does not need that much power. If every password reset, shipping update, or meeting summary goes to your most expensive model, cost climbs while the user experience barely changes.

A small support inbox makes this obvious. An order status question should get a quick, cheap answer. A refund dispute with account history and policy checks needs more care. A message that includes personal records may need a different route altogether, or no outside model at all.

That is why LLM routing rules matter. Production systems need clear rules for cost, speed, quality, and data handling before the first live users arrive. Without those rules, the demo works and the product leaks money, time, and trust.

What each task needs

Most routing mistakes start with a bad assumption: teams treat every prompt like the same kind of work. They are not. You get better results when you sort tasks by job type before you sort them by model.

A support tool, for example, might do several different jobs in one flow:

sort a message into the right queue
pull facts from docs or old tickets
extract fields like order number or refund amount
draft a reply in the right tone
make a decision, or hand the case to a person

Each job needs its own rules. A short classification task usually needs speed and low cost more than deep reasoning. Drafting needs better writing. Extraction needs strict structure. Decision steps carry more risk, so they need tighter rules and sometimes no model at all.

For each task, write down the input size, the output you expect, and the cost of a bad answer. A model that works well on a 200 word email may break when you give it a 40 page contract. A task that returns one label is much easier to validate than a task that writes a customer reply.

Privacy matters just as much. Some tasks can run on public or lightly sensitive text. Others touch contracts, health details, payroll data, or internal plans. Mark those cases clearly. If a task needs private data, that should narrow your model choice before cost or speed enter the picture.

Output format deserves its own line in the task sheet. If the next step expects valid JSON, fixed fields, or a strict schema, say so. "Good enough" prose breaks workflows fast.

Treat customer facing work separately from internal work. A rough draft for an analyst is one thing. A billing reply sent to a paying customer is another. The second needs cleaner tone, stricter checks, and a safe fallback if the model gets confused.

This simple breakdown saves time later. You stop arguing about models in the abstract and start matching each task to the job it actually has to do.

Set your limits before you route

Routing usually breaks because teams start with models instead of limits. The expensive model gets used "just in case," private data slips into the wrong path, and nobody knows when a person should step in. That is how a tidy demo turns into a messy product.

Set a budget per task, not just per month. A monthly cap tells you spending went up, but not which workflow caused it. A support reply, a refund check, and a contract summary should each have their own cost ceiling. If one task can justify only a few cents, write that rule down and enforce it.

Response time needs the same treatment. Users judge the product by the wait, not by the model name. A live chat answer may need to arrive in a few seconds. An internal research summary can take longer. If you do not define an acceptable delay for each task, routing will drift toward slower and more expensive choices.

Keep privacy rules plain. Split data into a few clear groups such as public, internal, and restricted. Then map each group to where it can run. Public text might go to an outside API. Customer records, contracts, health data, or payment details may need to stay in your cloud or on your own machines. Put that rule in code and config, not in a forgotten wiki.

Human review also needs a hard line. "Review when needed" is not a rule. Decide which cases always go to a person, such as:

payment changes or refunds
legal or compliance wording
account suspension or closure
low confidence outputs or conflicting results

Many teams skip this part, and it costs them later. Clear limits make routing boring in the best way. The system spends what you expect, answers inside a known window, keeps sensitive data where it belongs, and hands risky work to a person before it causes trouble.

Build the policy step by step

Most teams make routing too clever too early. A better policy starts small, with one default model that handles routine work well enough at a low price. That usually means short summaries, draft replies, simple classification, and routine extraction.

Then add one stronger model for the cases that actually need it. Use it for long inputs, messy reasoning, or tasks where a weak answer creates extra work for a human later. If the expensive model handles only 10 to 20 percent of traffic, you keep quality where it matters without paying premium rates on every request.

Privacy needs its own route. If a task includes customer records, contracts, internal code, or regulated data, send it through a private path from the start. Do not rely on a prompt that says "ignore sensitive data." The router should decide before the request leaves your system.

A simple policy is often enough:

send short, routine requests to the default model
send long, complex, or low confidence cases to the stronger model
send sensitive requests to the private path
stop waiting after a fixed timeout, then retry once or hand off

Those last rules matter more than teams expect. Write down exact timeouts, retry counts, and fallback behavior. If the default model times out after four seconds, retry once. If it fails again, move the task to the stronger model only if the task is not sensitive and still fits the budget. If both paths fail, return a safe partial result or send the case to a person.

Test the policy on real traffic samples, not hand picked prompts from a workshop deck. Pull a week of real requests, remove private data, and replay them. That is where you find long messages, half finished forms, mixed languages, and vague user intent. It is also where routing starts to look like a real product instead of a demo.

A support inbox example

Cut Waste in AI Ops

Match each task to the right model so routine work stays fast and affordable.

Cut Waste

A support inbox is a good place to use this approach because the work repeats, the stakes vary, and response time matters. Most tickets are simple. A few expose private data, create refund risk, or need a careful answer. If you send every message to the biggest model, you pay too much and wait too long.

Start with a small, fast model that only sorts the ticket. It does not answer yet. It reads the message and assigns a topic such as billing, refund, login trouble, bug report, sales question, or legal request. This first pass should finish in seconds and return a confidence score.

Then route by rule:

billing and refund tickets go to a model that follows strict output formats
account access, identity, and legal issues go through a private path
general product questions stay on the cheaper path
low confidence tickets go to a stronger model or straight to a person

The billing path needs extra discipline. If the model must produce a refund summary, a case tag, and a short reply draft, force that shape every time. Loose answers create more work for the support team. A slightly slower model is often worth it here if it sticks to the format.

Keep privacy rules simple. If a ticket mentions an account number, invoice, contract, personal data, or a threat of chargeback, do not send it through the normal public route. Send it to the private path, log the reason, and limit what the model can see.

Unclear tickets are where many systems break. A customer might write, "I was charged twice and now I can't log in." That is both billing and account access. The first model may guess wrong. Track how often a later step changes the route after the first pass. If reroutes happen a lot, your labels are too vague or your first model is too weak.

A good support flow does not chase perfect automation. It sorts the easy tickets fast, handles sensitive cases with care, and hands messy cases to someone who can judge them.

Plan for slow answers and hard failures

Users do not care which model failed. They care that the product still responds and does not leave them staring at a spinner. If a reply takes too long, send a short safe message before they leave, such as "I need a bit more time" or "A person will review this request." That small step keeps trust intact.

A good routing setup needs a time budget. Give the first model a short window to answer. If the error looks temporary, like a timeout or rate limit, retry once. If it fails again, switch to the backup model instead of waiting longer.

A simple failure ladder

One practical order looks like this:

try the preferred model with a strict timeout
retry once on a temporary error
send the task to a backup model with a smaller prompt if needed
return a safe fallback message if the answer still does not arrive
route high risk cases to a person

Do not retry forever. Two failed attempts usually mean the user needs a different path, not more waiting.

Caching helps more than many teams expect. Stable requests, like FAQ answers, document labels, or short summaries of the same policy text, should not hit a model every time. A simple cache can cut seconds from repeat requests and lower cost at the same time.

Every fallback needs a reason in the logs. Record whether the system hit a timeout, a rate limit, a privacy rule, a malformed answer, or a confidence check. Later, those reasons show you where the policy breaks. Without that record, failures blur together and teams guess.

Some requests should never end with an automatic guess. Payment disputes, account access problems, legal wording, and anything tied to safety need a manual path. The system can collect context, draft a response, and attach the failed output, but a person should make the final call.

That mix of fast fallback, one retry, caching, clear logs, and human review is what makes the system reliable.

Measure the system after launch

Build Lean AI Infrastructure

Build the infrastructure and monitoring real AI products need after launch.

Get Infra Help

Once routing goes live, you need facts, not gut feel. A route that looks cheap on paper can cost more later if it times out, retries, or sends too many requests to a fallback model.

Track each route on its own. Do not blend support replies, document extraction, and classification into one dashboard. Each task fails in a different way, so each one needs its own scorecard.

A simple scorecard should include:

answer quality from a small review sample
cost per request or per 100 requests
median latency and p95 latency
timeout rate, retry rate, and handoff rate
format failure rate, such as broken JSON or missing fields

Quality only means something next to cost. If one route gives slightly better answers but costs four times more, keep it only if the gain is real. A premium model might improve refund email drafts, but not enough to justify the extra spend on routine order status questions.

Review bad outputs by task type. Put extraction errors in one bucket, policy mistakes in another, and formatting failures in a third. When teams throw every bad answer into one pile, they miss the pattern. You cannot fix a JSON problem with the same rule you use for weak writing.

Some rules look smart and never fire. Remove them. Old fallback branches, rare exception paths, and duplicate privacy checks add noise and make debugging slower.

Tighten rules quickly when a route risks private data exposure or keeps breaking the expected format. If a cheaper model starts leaking details from support tickets or returns malformed fields, stop using it for that task. Send that task to a safer model, add stricter validation, or require human review.

Teams that review routes every week for the first month usually find obvious waste. One or two policy changes can cut spend, lower failure rates, and make the system easier to trust.

Mistakes teams make early

Early projects usually go wrong for a simple reason: teams pick models because they are popular, not because the task needs them. A flashy model can look great in a demo, then waste money on routine work like tagging tickets, cleaning text, or drafting short replies.

The safer approach is boring on purpose. Match the model to the job. If a task needs speed and low cost, use the cheaper fast model. If a task needs careful reasoning, longer context, or stricter privacy handling, route it there on purpose and write down why.

Many teams also make one model do everything. That feels easier for a week or two. Then the problems show up. Simple tasks cost too much, slow tasks block the queue, and one outage can break the whole product.

A split setup works better: use a small fast model for classification and short rewrites, a stronger model for harder reasoning, a private route for sensitive data, and a fallback model for timeouts or outages.

Context limits cause trouble earlier than most teams expect. A routing rule may look fine on paper, but real users paste long emails, logs, contracts, or chat history. If the input is too large, the model may fail, cut the prompt, or answer from partial context. Those are the worst errors because the reply can still look polished.

Test ugly inputs before launch. Try oversized prompts, messy pasted text, mixed languages, and long threads with repeated instructions. If the router does not check size first, users become the test team.

Failure handling gets skipped for the same reason. People assume they will add retries and fallbacks later. Later usually means after customers hit timeouts, half written answers, or silent drops. Run drills early. Force a model timeout. Force a provider error. Force a privacy rule that blocks the first choice model. Then check what the user sees.

Spending is another blind spot. Monthly totals hide waste. A team may think costs look fine while one task burns most of the budget because prompts are too long or the wrong model handles easy work.

Track cost per task, not just per month. Track latency the same way. If one support reply costs 20 times more than another, you want to know which route caused it and whether the extra quality was real. That level of tracking turns routing from guesswork into something you can fix.

Quick checks before rollout

Move Faster With CTO Advice

Get hands-on help with routing policy, infrastructure, and AI driven development.

Talk to Oleg

Before you turn routing on for live users, do one last review with the people who will own the outcome. Product, engineering, support, and anyone handling compliance should understand the rules. If one person cannot explain why a request goes to one model instead of another, the policy is still too messy.

A short checklist catches most expensive mistakes:

set a hard cost cap for every route, not just for the whole system
give each route a timeout and a fallback
mark sensitive work clearly and keep it on approved infrastructure only
review a small sample of outputs every week
write each rule as one plain sentence

A simple test case makes this real. Imagine a support inbox that classifies incoming email, drafts a reply, and flags urgent cases. Classification might go to a cheaper model with a low budget cap. Reply drafting might use a better model, but only with a five second timeout and a backup option. If the message includes billing details or account data, the route stays on approved infrastructure even if that path costs more.

Teams often skip these checks because the demo already works. Production does not care about a good demo. It cares about spend, response time, privacy, and whether someone can fix the system at 2 a.m. A one page routing policy usually beats a clever setup nobody trusts.

Next steps for your team

Ten routes sound thorough. In practice, they usually create noise. Start with two or three clear paths so people can see why each request went where it did.

A good first pass is simple: one cheap and fast model for routine work, one stronger model for messy cases, and one human handoff when the risk is too high. That is enough to test this approach in a real product without turning policy into a maze.

Pick one customer flow and run it for a week. A support inbox works well because the volume is steady and the results are easy to judge. Compare reply time, cost per ticket, and how often the route changed after the first pass.

Write the rules in plain language before you ship them. If a teammate cannot read the rule and predict the route, the rule is too vague.

A simple starting set might be enough:

send billing, refunds, or account access issues to the stronger model first
hand off to a person when the message raises legal risk, identity doubt, or repeated confusion
keep private customer data inside the route you already approved for it
use the cheap model only for repeat questions and other low risk work

Use real triggers, not gut feel. A stronger model might step in when a message is long, includes several requests, or mentions a failed charge. A person should step in when the customer is still confused after one reply, or when the model asks for missing details twice.

Keep the rollout small. One week of clean notes beats a month of vague impressions. By the end, your team should know which rules saved money, which ones slowed things down, and which ones nobody trusted.

If you want a second opinion before a wider rollout, Oleg Sotnikov at oleg.is reviews routing rules, infrastructure, and rollout plans for startups and smaller companies moving AI into production.

Frequently Asked Questions

Why not send every request to one strong model?

Most product traffic does not need your strongest model. Routine requests need fast, cheap answers, while only harder cases need deeper reasoning. If you send everything to one top model, you raise cost and wait time without giving users much more value.

What should I sort first when I design routing rules?

Start with the job, not the model. Separate classification, extraction, drafting, and decision steps first, then set cost, speed, privacy, and format rules for each one.

When should a request stay off a public model API?

Keep data on a private path when a request includes customer records, contracts, payment details, health information, or internal code. Let the router make that choice before the request leaves your system.

How do I set a budget for each route?

Set a spending cap for each task, not just for the month. A support reply, a refund check, and a contract summary can justify very different costs, so write those limits down and enforce them in code.

How many routes should I launch with?

Begin with two or three paths. Use a cheap model for routine work, a stronger model for long or messy cases, and a human handoff for risky situations. That setup gives you clear rules without turning routing into a maze.

Should I test routing with demo prompts or real traffic?

Use real requests from production or a recent sample, not polished demo prompts. Real traffic shows you long messages, missing details, mixed languages, and odd edge cases that a workshop example will never catch.

How should I handle timeouts and model failures?

Give the first model a short timeout and retry once if the error looks temporary. If it fails again, switch to a backup path or hand the case to a person, and show the user a short safe message instead of leaving them waiting.

What metrics tell me if routing actually works?

Watch cost per request, median and p95 latency, timeout rate, retry rate, handoff rate, and broken output rates such as bad JSON. Review each task on its own, because extraction, drafting, and classification fail in different ways.

What mistakes do teams make early?

Teams often copy popular models, let one model do every job, ignore long inputs, and skip failure drills. They also miss per task cost tracking, so waste hides until support tickets and cloud bills pile up.

When should a human take over?

Send a case to a person when money, access, legal wording, safety, or identity checks are involved. You should also hand off when the model shows low confidence, asks for missing details more than once, or keeps returning the wrong format.