Oct 10, 2025·8 min read

Multilingual evaluation sets for global support teams

Learn how multilingual evaluation sets expose support gaps, reflect real customer language, and give teams a clearer way to test quality.

Table of Contents

Why English-only tests miss support issues

English is a useful starting point, but it can fool teams into thinking support quality is better than it is. An assistant might sound clear and polite in English, then become awkward, too blunt, or simply wrong in another language.

Some problems only appear when the model has to handle local grammar, tone, and social norms. A reply that sounds neutral in English can feel rude in Spanish. A short answer that seems efficient in English can sound cold in Arabic if it misses the expected level of formality. The facts may stay the same, but the user experience changes.

The shape of the question changes too. People do not write support requests in neat textbook sentences. They use slang, local terms, mixed languages, typos, and rushed phone typing. English test cases often smooth all of that out. Once a team translates those clean cases into another language, it usually removes the messy details that make real support hard.

Take a simple login issue. In English, a user might write, "I can't log in after changing my phone." A translated version often stays short and tidy. A real Spanish message might include a typo, an emotional complaint, and a local phrase for a two-factor code. If the assistant only saw the clean English version during testing, it may miss the intent or give the wrong recovery steps.

Policy mistakes hide the same way. A model may follow refund or account recovery rules in English, then drift in another language because the prompt, retrieval, or guardrails do not carry across cleanly. That is how a test passes in English and still fails in Spanish or Arabic.

Users also describe the same problem in very different ways. One asks directly. Another tells a story. Another sends half a sentence and expects the assistant to fill in the gaps. If your tests only reflect English phrasing, you are not really testing support. You are testing one narrow writing style.

That is why multilingual evaluation sets matter. They show whether the assistant understands how people actually ask for help, not just how the team first wrote the case in English.

Decide what good support looks like

Before you write a test case, decide what success means. Teams often judge support by whether a reply sounds smooth in English. That is too shallow. A friendly sentence can still give the wrong refund rule, miss a safety issue, or fail to hand the case to a human.

Start with the real job. Focus on requests that appear every week, not unusual edge cases. Most support assistants need to answer account and order questions, explain policy, ask for missing details, set expectations, and escalate when a case goes beyond simple help.

A basic task list is enough to start. The assistant should answer common account or order questions, explain policy without guessing, ask for the one detail that unlocks the case, refuse unsafe requests, and hand off messy, urgent, or sensitive situations.

Then split scoring into two parts. One score should measure whether the assistant solved the support task. Another should measure language quality. That split matters a lot. A reply can be fluent in French and still give the wrong cancellation rule. It can sound slightly awkward in Japanese and still provide the correct steps and the right handoff.

Score the outcome and the wording separately

Check facts first. Did the assistant give the right policy, ask the right follow-up question, and avoid making things up? If the user asked about a refund, the assistant should not invent timing, exceptions, or account details.

Then check whether the reply sounds natural for that market. You want clarity, respect, and a tone that fits support. Grammar matters, but not more than accuracy. A polished sentence with bad guidance should fail.

Safe behavior needs its own rules. The assistant should protect user data, avoid asking for secrets such as full card numbers or passwords, and stop when identity checks or human review are required. It should also escalate when the user sounds stuck, angry, or at risk of harm.

Set pass rules before the first run. Keep them simple:

The policy and next steps must be correct.
The tone must stay calm, clear, and respectful.
The assistant must escalate when it lacks authority or context.
A safety failure should fail the case, even if the wording sounds good.

Once those rules are fixed, results become easier to trust. You can see whether the model has a language problem, a support problem, or both.

Choose languages by risk, not by guesswork

Many teams choose languages by instinct or by whoever speaks up first. That creates a false sense of coverage.

Start with the languages tied to the most tickets or the most revenue. If a large share of support volume comes from Spanish and Portuguese, those languages belong in the set before a lower-volume market that only feels strategically interesting.

Revenue matters because support failures often show up after the sale. A weak answer about billing, shipping, or renewals can turn into churn faster than a minor product question. Good test coverage follows business risk, not translation convenience.

Risk also depends on the kind of mistake. Add markets where support teams handle refunds, cancellations, privacy requests, or policy questions that can create legal trouble or repeated chargebacks. A smaller market may deserve early testing if one wrong answer creates a painful mess.

A practical order is straightforward. Start with the languages that drive most tickets. Then add the ones tied to paid plans or larger accounts, markets where refund or compliance errors hurt, at least one non-Latin script if customers depend on it, and regional variants that change meaning in support replies.

That non-Latin script point matters more than many teams expect. If users rely on Arabic, Japanese, Korean, Thai, or another non-Latin script, include it early. Script handling can expose problems that never appear in English, such as broken formatting, odd tone, or names and addresses parsed the wrong way.

Regional differences matter too. "Spanish" is often too broad for support testing. A return policy reply that sounds normal in Spain can feel stiff or confusing in Mexico. The same issue appears with Portuguese, French, and Arabic.

If a company sells mostly in the US, Mexico, and Saudi Arabia, English alone tells you very little. Testing English, Spanish for Mexico, and Arabic covers ticket volume, revenue exposure, and script risk in one pass.

Collect real conversations first

Start with what customers already ask. If you invent test prompts in clean English, the assistant may look smarter than it is. Real support traffic shows how people write when they are tired, rushed, confused, or annoyed.

Pull examples from tickets, chat logs, email threads, and contact forms. Look for repeated moments like password resets, refund status, delivery delays, invoice copies, or account lockouts. A set built from real traffic usually catches more failures than one written in a meeting room.

Do not clean up the language. Keep the short forms, spelling mistakes, mixed-language phrases, missing punctuation, and half-finished sentences. Someone might type "cant login since yday pls help" or "where invoice??". That mess is part of the job.

Before you reuse any conversation, remove personal data. Strip out names, phone numbers, addresses, order numbers, card details, and anything else tied to one person. Replace them with simple placeholders such as "[order number]" or "[email]". It takes longer, but it prevents a privacy problem later.

Group cases by intent, not by the internal team that handled them. "Update my delivery address" is the same customer goal whether it first reached shipping, billing, or general support. That makes scoring much cleaner.

A short intent list is enough at the start: account access, billing and invoices, delivery and order status, returns and refunds, and plan or subscription changes. If one intent appears in several languages, keep those examples together so you can compare performance across markets.

Write cases in natural local language

Test Real Customer Messages

Build eval cases from actual support traffic instead of tidy translated prompts.

Plan Session

A support test stops being useful when the message sounds like a translation exercise. If you write a case in English and translate it line by line, you usually lose the phrasing, shortcuts, and small mistakes real customers use. A native speaker should rewrite the case from scratch so it sounds normal in that market.

That rewrite should keep local details intact. Customers talk about prices in their own currency. They use familiar date formats and local address structures. A billing question from Spain should look like a message from Spain, not a US message with Spanish words pasted on top.

Tone changes the result more than many teams expect. One customer writes, "Could you please cancel my subscription?" Another writes, "Cancel this today." The task is the same, but the assistant has to handle both without getting defensive or missing the intent. Add polite, neutral, and blunt versions of the same request so the set reflects real traffic.

Code-switching belongs in the set too when users actually do it. Many people mix languages in one message, especially around product names, billing terms, or office jargon. A message like "Please send la factura for March to my office email" should not confuse the assistant if that pattern is common among your users.

The difference is easy to see. A direct translation might say, "I want to change my address for invoice delivery." A native rewrite might say, "Use this address on next month's invoice" and include a local postal code and date format. The second version is much closer to what support teams really receive.

Line-by-line translation is fast, but it creates a fake test set. Native rewrites expose real gaps.

Build and score the set step by step

Start small. For the first version, pick 20 to 30 support intents that drive most of your ticket volume. Think password resets, billing questions, refund requests, order status, cancellations, and account access. If you support several languages, keep the intent the same and write matching cases in each language. That gives you comparable results instead of a random mix.

Each case needs two things: one customer message and one expected outcome. Keep the expected outcome simple and specific. It should describe what the assistant must achieve, not the exact sentence it must produce. For example, "confirm the order number, explain the refund window, and send the case to a human if the payment looks disputed" is useful. "Give a helpful answer" is not.

Score each case the same way

Use a short rubric that reviewers can apply without guessing. Score accuracy first. List the facts the assistant must include, the mistakes it must avoid, and whether it needs one follow-up question. Score tone separately, because a reply can be correct and still sound blunt, robotic, or too casual for that language. Score handoff rules as well. Make clear when the assistant should transfer the case, when it should refuse, and what context it must pass to a human.

Before you trust the rubric, test it on 5 to 10 cases with two reviewers. If they keep disagreeing, the rubric is vague and needs rewriting.

Keep the scale plain. Pass or fail is fine for some teams. A 0 to 2 score per area also works well: 0 for wrong, 1 for partly right, 2 for fully right. Fancy scoring usually creates more debate, not better support QA.

Run the same set after every model, prompt, tool, or retrieval change. That consistency is what makes differences visible. If a prompt improves English but hurts Spanish refund cases, you will see it quickly.

Save old scores. Add a few new cases each month from real failures, but do not replace the whole set at once. A stable set shows whether quality is moving or whether the test simply changed.

A simple example from one support flow

Get an Outside Technical Review

Use an experienced advisor when support issues span models, infra, and team workflow.

Book Call

Take a common case: a customer wants a refund after a charge they did not expect. This is a good test because the policy stays the same while tone and wording can shift a lot across languages.

Start with three messages that ask for the same outcome:

English: "I want a refund for the annual plan. I thought I canceled before renewal."
German, formal: "Guten Tag, ich mochte eine Ruckerstattung fur die Jahresgebuhr beantragen. Ich bin davon ausgegangen, dass ich vor der Verlangerung gekundigt hatte."
Arabic, angry: "اريد استرداد المبلغ الان. تم تجديد الاشتراك بدون ما انتبهت، وهذا غير مقبول."

The assistant should do the same job in all three cases. It should explain the refund policy in plain language, ask for one missing detail such as the order email or charge date, and keep the tone calm. It should not blame the customer with lines like "you forgot to cancel" or "this was your mistake."

A good English reply might explain that refunds are possible within a certain period after renewal and then ask for the email tied to the purchase. The German version should keep the same meaning, not become colder or more legal just because the customer sounds formal. The Arabic version should stay respectful even if the user is upset.

Reviewers should compare more than grammar. They should ask four questions: Did every version explain the same policy? Did every version ask for only one missing detail? Did any version sound blaming or defensive? Did the angry and formal versions still lead to the same outcome path?

This kind of case often exposes hidden gaps. One language may give a clear answer, while another skips the policy, asks for too many details, or sounds harsher than intended. If meaning shifts by language, support quality shifts too.

Common mistakes that distort results

Most weak eval sets fail for ordinary reasons. A team writes cases in English, translates them, and assumes that covers the market. It does not. Real customers use shortcuts, local terms, spelling habits, and phrasing that never appear in neat translations.

A customer in Brazil may ask about installment payments in a way that feels normal to local support staff but odd in English. A translated test can miss that pattern completely. The assistant then looks strong in review and weak in production.

Scoring often goes wrong in a quieter way. Reviewers hear smooth language and give the answer a high score even when the policy is wrong. If the bot gives the wrong refund window, misses a required escalation, or asks for information it should not collect, the answer failed. Good grammar does not rescue a bad decision.

Another problem is comfort. Teams fill the set with simple requests because they are quick to write and easy to review. That inflates results. Real support gets messy fast, and the set needs to reflect that.

You can usually spot a distorted set quickly. It has too many basic questions with clear wording, too few cases with missing details or mixed language, too few policy edge cases, and almost no emotional range.

Then the set gets old. Product steps change. Pricing changes. Support rules change. Old cases stay in the spreadsheet for months. Scores stay stable, but they stop meaning much.

A simple rule helps: when support macros, help center text, or internal policy docs change, update the related cases that same week. Otherwise you end up measuring how well the assistant remembers a product that no longer exists.

Quick checks before each test run

Improve Support Handoffs

Check when your assistant should ask, refuse, or send the case to a human.

Review Handoffs

A fast test can still mislead you. Ten minutes of preparation can save hours of argument over stale cases, uneven scoring, or weak language coverage.

Start with the cases. Refund windows, shipping limits, identity checks, and escalation paths change more often than teams admit. If one case still expects last month's rule, the assistant can give the correct answer and still get marked wrong. Check a small sample from every intent against current policy before running the full set.

Reviewers need calibration too. Give them the same rubric, the same examples, and a few borderline cases to score together. If one reviewer punishes tone and another only checks accuracy, the results will drift. That problem gets worse across languages because people read politeness differently.

Coverage matters as well. Each language needs enough cases for each intent, not just enough cases overall. Twenty billing cases in English and three in German do not tell you much about German support quality.

When answers fail, sort them by cause before you sort them by language. A weak result may come from outdated policy in the test case, reviewer disagreement, poor retrieval, unnatural local phrasing, or a tone problem that only appears in one language. That extra step helps you fix the right thing first.

If the same policy error appears in Spanish, French, and English, the problem likely sits in the assistant or its knowledge source. If only one language struggles, check the local wording, the examples, and the reviewer notes.

Next steps after you find the gaps

Start with the errors that hit customers directly. If the assistant gives the wrong return policy in German or misses an urgent billing issue in Portuguese, fix that before you polish tone. A polite answer still fails if it is wrong, late, or routed to the wrong queue.

Group problems by cause, not only by language. Most gaps come from three places: the prompt, the knowledge source, or the routing logic. If you change all three at once, you will not know what actually helped.

A practical order is simple. Fix factual mistakes first. Then fix routing mistakes that send users into the wrong flow. After that, improve missing local terms, spelling, and phrasing, and then clean up the edge cases that matter for refunds, payments, and account access.

Make one change at a time. Update the prompt if the assistant sounds too certain when it should ask a follow-up question. Update the knowledge source if policy text is old, incomplete, or only written in English. Update routing if delivery, billing, and cancellation requests keep landing in the same generic path.

Then run the same test cases again. Do not swap in new cases right away. Repeating the same set shows whether the score moved because of your change or because the test changed. Keep a small log with the date, language, issue fixed, and before-and-after score. Patterns show up quickly.

This is where a multilingual set proves its worth. It stops teams from claiming progress based on one strong English result while the French, Arabic, or Japanese experience still breaks on basic requests.

If the gaps span prompts, knowledge sources, and support flows, an outside review can help. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor, and this kind of support assistant review fits naturally into that work. It is most useful when you already have a test set and a clear list of failures, because the discussion stays practical.

Frequently Asked Questions

Why do English-only tests miss real support problems?

Because support quality changes with language, not just with facts. A reply can sound clear in English and still feel rude, cold, or confusing in Spanish, Arabic, or Japanese.

English tests also miss the messy way people really ask for help. Slang, typos, mixed language, and rushed phone typing often reveal failures that clean English prompts never show.

What should a multilingual support test actually score?

Start with the job itself. The assistant should give the right policy, ask the right follow-up when it needs one, and hand the case to a human when the issue gets sensitive, urgent, or outside its scope.

Then check language quality on its own. Natural tone matters, but a smooth sentence should still fail if it gives the wrong refund rule or asks for data it should not collect.

How do I choose which languages to test first?

Pick languages by risk, not by guesswork. Start with the ones tied to the most tickets, the most revenue, and the biggest refund, billing, or privacy exposure.

If customers use a non Latin script, include it early. Script handling often shows formatting, parsing, and tone problems that English never exposes.

Should I translate my English test cases?

No. A direct translation usually sounds too neat and removes the local phrasing that makes support hard.

Ask a native speaker to rewrite the case from scratch so it sounds like a real message from that market. Keep local currency, date format, address style, and common shorthand.

Where should I get good test cases from?

Use real support traffic first. Pull examples from tickets, chats, emails, and contact forms, then group them by customer intent instead of by internal team.

Keep the messy parts that users actually send, but strip out personal data. Replace names, order numbers, emails, and phone numbers with simple placeholders.

What should each test case include?

Each case needs one customer message and one expected outcome. The expected outcome should say what the assistant must do, such as explain the policy, ask for one missing detail, or send the case to a human.

Do not lock the case to one exact sentence. You want to judge whether the assistant solved the task, not whether it copied your wording.

How many cases do I need to start?

Start small. A set with 20 to 30 common support cases usually gives you enough signal to catch obvious failures in account access, billing, refunds, cancellations, and order status.

Keep the intent the same across languages when you can. That makes results easier to compare and shows whether one language breaks on the same job.

How do I stop fluent wording from hiding bad support decisions?

Score accuracy first and tone second. If the assistant gives the wrong policy, misses a required handoff, or asks for unsafe information, the answer fails even if it sounds polished.

A simple rubric helps. Reviewers should agree on the facts the reply must include, the mistakes it must avoid, and when it must escalate.

What mistakes usually make multilingual eval results unreliable?

Teams often make the set too easy. They write tidy prompts, avoid emotional or incomplete messages, and skip mixed language or local terms.

Old policy text also ruins scores fast. When refund windows, billing steps, or identity checks change, update those cases right away or your results stop meaning much.

What should I do after I find gaps in the results?

Fix customer facing errors first. Wrong policy, bad routing, and missed escalation hurt more than awkward phrasing, so start there.

Change one thing at a time, then rerun the same set. Keep old scores and a short change log so you can see whether the prompt, knowledge source, or routing update actually helped.