Fine-tuning a small classifier before building an agent
Learn when fine-tuning a small classifier beats a larger agent for routing and tagging, using narrow labels, cheap evals, and a simple rollout plan.

Start with the actual decision
Most teams aim too high on version one. They picture an agent that reads a request, gathers context, calls tools, writes a response, and updates other systems. But if the real job is simply to sort requests into two or three buckets, start there.
For fine-tuning a small classifier, write the decision as one plain question. It should be so clear that two people would answer it the same way most of the time. "Should this message go to engineering, sales, or billing?" is clear. "What should we do with this customer?" is not.
Keep the first version tied to one workflow. Pick one inbox, one queue, or one form. A narrow label classification model improves faster when every label belongs to one daily task instead of several half-related jobs.
Before you collect examples, write down the cost of a wrong answer. That sharpens the labels fast. If a billing issue lands in sales, you might lose 10 minutes. If a security report lands in general support, you might lose a day and frustrate the sender. Those costs tell you where you need tighter labels, more examples, or a human review step.
A simple setup needs four decisions on paper: the exact input the model sees, the exact label set, what happens after each label, and which mistakes you can accept for now.
Teams often mix up routing and full automation. Routing means the model picks a lane. Automation means the system also takes action inside that lane. Keep those separate at the start. If you combine them too early, bad labels disappear inside tool calls, longer prompts, and messy logs.
You can test this cheaply before building anything fancy. Take 50 to 100 real examples, label them by hand, and check where the model disagrees. Five repeated mistakes teach you more than a polished demo where an agent gets one request right.
If the classifier cannot route cleanly, a larger agent will only fail in more expensive ways.
Why a small classifier often wins first
A small classifier is often the better first move when the job is simple: send an input to one of a few clear buckets. If a support message only needs "billing," "bug," or "sales," a fixed set of labels usually works better than a full agent. It answers faster, costs less, and fails in ways your team can spot quickly.
The cost gap gets real fast. Tool calling often needs extra model turns, longer prompts, and retries when the model picks the wrong action. A classifier usually needs one short request. If you handle 50,000 requests a day, even a small drop in cost per request adds up by the end of the month.
Speed matters just as much. Users notice a pause. They notice it even more when the system stops to think, picks a tool, asks another question, and only then makes a basic routing choice. Measure response time on real traffic, not on a neat demo set. A model that routes most requests in 300 milliseconds is often more useful than an agent that takes 2 or 3 seconds to reach the same answer.
Fixed labels also keep the problem honest. With a small decision set, you can write examples around the boundary and run cheap evals. You can ask simple questions: did "refund request" go to billing, and did "app crashes on login" go to bug? That is much easier than grading a long chain of hidden choices inside an agent.
Save tool calling for actions, not triage. Use tools when the system needs to do something real, like create a ticket, search an order, or update a record. Let the classifier decide where the request belongs first. Then let the right workflow take over.
That split holds up well in production. Teams that start with fine-tuning a small classifier usually learn faster, spend less, and avoid a lot of messy debugging later.
Choose labels that stay clear
A classifier gets shaky when labels look tidy on paper but blur together in real use. If two people read the same message and choose different labels, the model will learn that confusion too.
Use words your team already says out loud. "Billing issue" beats "commercial inquiry." "Bug report" beats "technical product concern." Plain labels speed up annotation and make mistakes easier to spot later.
Overlap is the usual trap. Teams often create pairs like "sales question" and "pricing question," or "bug" and "broken workflow." If both labels send the message to the same person or trigger the same next step, merge them. More labels do not make the system smarter. They usually spread your training data too thin and make the boundary worse.
An early "other" bucket saves time. Without it, annotators force odd samples into the closest label, and that pollutes the dataset. "Other" gives the model a safe place for edge cases, mixed intent, and messages that do not deserve automation yet. You can split that bucket later if it starts filling up with one clear pattern.
Keep each sample tied to one label unless you truly need more than one output. One label per sample is easier to review, easier to test, and easier to route. A message like "I need an invoice and the app crashes on login" contains two issues, but a routing model still needs one next action. Pick the label that matches the first owner or first queue.
A quick test helps before you label thousands of examples. A new teammate should be able to label 20 samples with almost no questions. Each label should map to one action. Edge cases should fit in "other" instead of distorting a real class. And after reading real messages, you should not feel an urge to rename half the labels.
This matters even more when you are fine-tuning a small classifier. Clear labels let you compare versions quickly and see exactly where the boundary slips.
Gather examples that teach the boundary
Your model learns the line between labels from examples, not from the label names. If the examples are clean, narrow, and real, training gets much easier.
Start with messages your team already handles. Tickets, chat logs, email threads, support forms, sales intake forms, and internal triage notes usually give you better training data than examples someone writes from scratch.
Made up samples often sound too neat. Real messages are messy, short, emotional, vague, and full of mixed intent. That mess is exactly what the model needs to see.
Collect examples from the place where the decision already happens. If a support team routes inbound requests, use that queue. If a founder reviews lead forms by hand, use those forms. Do not train on polished summaries if the model will read raw text.
Keep the label counts close enough that you can compare results honestly. They do not need to match perfectly, but they should not be wildly uneven. If one label has 500 examples and another has 25, weak results may come from thin data rather than a bad label design.
You also need edge cases on purpose. Include messages that mention two intents at once, short notes with missing context, messages that use the wrong words for the right problem, rare but costly cases, and near duplicates that belong to different labels.
Picture a small SaaS inbox with labels like "billing issue," "bug report," and "feature request." A message like "I got charged twice and the invoice page crashes" is more useful than ten easy bug reports because it forces you to decide what the primary route should be.
Clean the data before training. Remove names, email addresses, phone numbers, account numbers, order IDs, and anything else that can identify a person or company. If private details affect the label, replace them with simple placeholders so the meaning stays intact.
Do one last check by reading random samples from each label. If humans disagree on where a message belongs, the model will disagree too. Fix that boundary before you add more data.
Train and check it step by step
A classifier earns trust when it beats a dumb rule on data it has never seen. Split your labeled examples into three parts: train, test, and holdout. The train set teaches the model. The test set helps you compare runs while you tune. The holdout set stays untouched until the end, so you get one honest final check.
Before you try fine-tuning a small classifier, build a baseline that feels almost too basic. A keyword rule, a lookup against similar past examples, or even "always pick the most common label" is enough. If your model cannot beat that baseline by a clear margin, stop and inspect the data. Most of the time, the labels are muddy or the examples do not show the boundary well.
Use one small model first. That keeps the work cheap and makes mistakes easier to explain. If you change the model, the labels, and the training recipe all at once, you will not know what helped. Keep a short log for each run with the model name, data version, label set, and results.
Do not trust one average score. A routing model can look good overall and still fail on the label that matters most. Check precision, recall, and the confusion matrix for each label on its own.
Then read the wrong predictions one by one. This part is slow, but it saves the most time later. You will usually find a pattern fast: two labels overlap, one label needs clearer wording, or a few training examples point in the wrong direction.
A small example makes this obvious. Say you route messages into "sales", "bug", and "account". If the model keeps sending billing complaints to "bug", the fix might be as simple as renaming "account" to "billing and account" and adding 20 better examples. That is a better move than adding tools, agents, or extra prompts too early.
A simple routing example
A small SaaS team does not need an agent to sort incoming messages. Most of the time, it needs one clean decision: where should this message go next?
Say the team gets 60 support emails and chat messages a day. Some are about failed payments, refund requests, or invoice copies. Others describe broken buttons, login errors, or missing data. A few are messy enough that nobody should trust automation on the first pass.
This is where fine-tuning a small classifier pays off. You give it a narrow set of labels, keep the rules plain, and make one decision per message. For this team, the labels might be billing, bug report, and human review.
A message like "I got charged twice and need a refund" should go to finance. A note like "The export button spins forever after I choose CSV" should go to product or engineering. If someone writes, "Your app is not working and I am upset," the model may not have enough detail. That case should go straight to a person.
Low confidence matters more than clever automation. If the classifier is only 55% sure, sending the message to a human is usually cheaper than sending it to the wrong team and creating a second round of triage. One bad handoff can cost more than ten manual reviews.
The team should also review misses every week. Keep a short list of messages that landed in the wrong place, plus the ones that went to human review but looked easy in hindsight. Then ask simple questions: did the label set make sense, did two labels overlap, or did the examples miss a common phrasing?
After a few rounds, the classifier usually gets sharper. Billing messages often use words like "invoice," "refund," or "card declined." Bug reports often mention a screen, a step, and an error. Those patterns are enough to solve routing and tagging for a small team without adding tool calling, long prompts, or extra moving parts.
That is a better starting point than building an agent that tries to read, reason, choose a tool, and reply all at once.
Mistakes that create extra work
Most wasted time comes from trying to make the model look smart before the task is stable. A classifier does best when the decision is plain, the labels are few, and the edge cases are visible.
One common mistake is starting with too many labels. Teams often split one fuzzy idea into six categories because it feels more complete. In practice, that usually creates overlap. If two humans argue about the right label, the model will struggle too. Start with broad buckets that people can name the same way every time, then split them later if the data proves you need it.
Synthetic samples can help fill gaps, but they should not be the whole dataset. Made up examples are often too clean. Real user messages are messy, short, sarcastic, incomplete, and full of mixed intent. A model trained only on polished synthetic text may score well in testing and then fail on day one.
Another costly mistake is forcing every input into a label. That hides uncertainty instead of managing it. If confidence is low, send the item to review, ask a follow up question, or place it in an "unclear" bucket. One extra step is cheaper than silent misrouting.
Jumping to agent flows too early also burns time. Tool calling, memory, and multi step plans can look impressive, but they add new failure points before you have solved the basic decision. If the real job is "tag this request" or "pick the next queue," an agent is often extra machinery. A small classifier plus cheap evals is easier to debug and usually faster to improve.
The last mistake shows up after launch. Labels drift. Support teams rename categories. Users change how they ask for help. New product lines appear. A classifier that worked last month can slowly get worse without any obvious crash.
A simple habit helps: review mislabeled items every week, track how often people override the model, keep a small set of fresh real examples for retesting, and merge labels that stay confused. Teams that do this usually fix routing problems early, before they grow into complicated systems nobody wants to maintain.
Checks before you add tools
A small classifier should pass a few boring tests before you let an agent call tools, change records, or send messages. If it cannot sort inputs cleanly, extra actions only make mistakes harder to spot and more expensive to fix.
Start with the labels themselves. Put every label on one page with a plain language definition and one short example. If that page feels crowded, or two labels sound almost the same, the model will struggle too.
Then test the labels with people, not just the model. Give the same 20 to 30 samples to two people and ask them to tag them on their own. If they often disagree, the problem is not the model yet. The boundary is still fuzzy.
Before you add tool access, make sure you can explain every label in one sentence, two people tag the same sample the same way most of the time, the model beats a simple rule set on a small test split, a wrong label causes less harm than a wrong action, and you can review misses in under 30 minutes each week.
That third point matters more than it sounds. If a few keyword rules match the model, keep the rules for now. A classifier should earn its place by catching cases the rules miss or by cutting down manual review.
The harm check matters even more. Sending a support ticket to the wrong queue is annoying. Refunding a customer, deleting data, or emailing the wrong person is a different class of mistake. If the cost of a bad action is high, keep the classifier as a routing layer and let a person approve the action.
Use a small recurring review. Pull the misses once a week, scan them, and ask one question: did the model fail, or did the labels fail? If you can do that in 20 to 30 minutes, you have a system you can actually maintain.
Next steps without overbuilding
Once the classifier makes clean decisions in testing, resist the urge to wire it into every workflow at once. Put it in front of one real queue, watch what happens for a week or two, and keep the scope small. A support inbox, lead form, or bug intake channel is enough to learn a lot.
Most teams add tools too early. If fine-tuning a small classifier already sorts requests into the right buckets, let it do that one job first. You do not need tool calling, memory, and action chains to decide whether a message is billing, product feedback, or a bug report.
A small rollout works best when you watch two things closely: confidence and volume. Low confidence shows where labels overlap or examples are thin. A sudden jump in one label tells you that traffic changed and your training set may no longer match real life.
Keep the first version simple. Send one queue through the classifier and log every prediction. Alert the team when confidence drops below your chosen floor. Watch for one label suddenly rising far above its normal share. Review misses every week and tag the new pattern, not just the old label.
Labels are not fixed forever. New customer phrasing shows up, product changes create new categories, and edge cases stop being edge cases. When you see the same confusion more than a few times, update the labels or split one crowded label into two.
Only add tools after routing stays steady. A good rule is boring consistency: the classifier should hold up through normal traffic, weird phrasing, and a few busy days before it triggers actions in other systems. If it still wobbles on simple tagging, tool calling will only hide the problem under more moving parts.
If you want a second opinion before the build gets too big, Oleg Sotnikov at oleg.is works with startups and small teams as a Fractional CTO and startup advisor. He often helps teams shape practical AI workflows, evaluate tradeoffs, and keep automation lean instead of turning a simple routing problem into an expensive agent project.
Frequently Asked Questions
What should I automate first with a classifier?
Start with one routing decision that your team already makes every day. A good first task sounds like, "Should this go to billing, bug report, or sales?" Keep version one tied to one inbox, form, or queue so you can see what works and what fails.
When does a small classifier make more sense than an agent?
Use a classifier when the job is just to sort inputs into a few clear buckets. It usually runs faster, costs less, and gives you cleaner mistakes to review. Save an agent for cases where the system must actually do something, like create a ticket or update a record.
How many labels should I start with?
Begin with as few labels as you can. Three to five labels often works well for a first version. If two labels send work to the same person or trigger the same next step, merge them instead of keeping both.
Should I include an other or human review label?
Yes, in most cases you should. An "other" or "human review" label gives the model a safe place for edge cases, mixed intent, and unclear messages. That keeps odd samples from polluting your real classes.
How many examples do I need before I test this?
For a cheap first check, label 50 to 100 real examples by hand and compare the model's guesses. That small set often shows repeated mistakes fast. For training, collect more once you know the labels actually make sense.
Where should my training examples come from?
Pull examples from the exact place where the decision already happens. Support inboxes, chat logs, forms, and triage notes usually teach the model more than polished examples someone writes from scratch. Real messages carry the messy wording the model will face later.
What if one message seems to fit two labels?
Pick the label that matches the first owner or first queue. Even if a message mentions two problems, your routing system still needs one next step. If nobody can agree on that step, send it to human review instead of forcing a shaky label.
How do I know if the classifier is actually good enough?
Check it against a dumb baseline first, like simple keyword rules or the most common label. Then look at per label results, not just one average score, and read the wrong predictions one by one. If the model beats the baseline and the mistakes look fixable, you have something useful.
When should I let the system call tools or take actions?
Wait until routing stays steady before you add tool access. If the classifier still mixes up labels, tool calls will hide the problem and make mistakes more expensive. Let the model pick the lane first, then let a person or a separate workflow handle the action.
How do I maintain the classifier after launch?
Review misses every week and watch for label drift. Users change how they write, teams rename categories, and new products create new patterns. If people keep overriding the same label or one bucket starts growing fast, update the labels and add fresh examples.