Oct 26, 2025·7 min read

Invoice coding with AI before you change your ERP system

Test invoice coding with AI on line items, set confidence thresholds, and prove value before you open a risky ERP replacement project.

Invoice coding with AI before you change your ERP system

Why invoice coding becomes manual work

Finance teams do not lose time because invoices are hard to read. They lose time because each invoice creates a string of accounting decisions. One bill might need to be split across departments, projects, tax treatments, cost centers, and account codes. That takes judgment, not simple data entry.

The drag shows up in small decisions that repeat all day. Someone has to decide whether freight should use the same code as materials, whether one line belongs to two cost centers, whether a vague supplier description matches the usual account code, or whether this month's invoice is close enough to last month's to follow the same pattern.

Even invoices from the same supplier still need checks. The layout may be familiar, but the details often shift. A service period changes. A credit appears. Two charges get bundled together. A manager wants half the cost moved to another team. The invoice looks routine, yet one small difference can make the old coding wrong.

That is how the work piles up. AP codes the invoice, a manager sends it back, and finance touches the same document again near month end. Delays spread into approvals, accruals, and reporting. Then people rush. They fix issues one by one, and finance cleans up the mistakes later.

The source of the problem is often narrower than it looks. Teams sometimes blame the ERP because that is where coding happens. But many issues start earlier. The chart of accounts may be too broad. Supplier descriptions may be inconsistent. Some rules may exist only in one person's head.

That distinction matters. If the real pain sits in line-item classification and coding rules, finance can test AI on that part first instead of jumping straight into a full ERP change. A small pilot can show whether the team saves time on the hardest work without opening a long replacement project.

What AI should do first

Start with a narrow job. The first use of AI here is not to redesign the whole finance process. It is to read each invoice line and suggest the same coding choices the team already makes by hand.

For most teams, that means three fields: account code, cost center, and tax treatment. If the tool can suggest those well on repeat invoices, finance gets a clean test. You can measure time saved, error rates, and how often a person still needs to step in.

A full redesign sounds tidy, but it slows everything down. New approval rules, new payment logic, and ERP changes bring in more people, more risk, and many more meetings. For a first pilot, leave approvals and payments alone. Let AI handle the narrow part that creates the most repetitive work.

Repeat invoices from known vendors are the best place to begin. They follow patterns. The wording may change a little, but the same supplier often bills the same services, uses the same tax setup, and lands in the same cost center month after month.

A sensible first scope is simple:

  • classify line items from a short list of known suppliers
  • suggest account, cost center, and tax code
  • show a confidence score for each suggestion
  • send low-confidence items to a person for review

That gives finance a safe test bed. The ERP stays in place. The approval chain stays in place. The payment run stays in place. You learn whether the model can handle the repetitive part before you let it touch anything sensitive.

Take a telecom supplier as an example. Monthly invoices may include mobile plans, roaming, and hardware charges. If the model can split those lines and code them the way your team usually would, that already saves real time.

This is usually how practical automation projects work. Prove accuracy on a narrow task first. Expand only after finance trusts the output.

Choose a pilot finance can actually manage

Start with invoices that already look similar from month to month. When finance tests AI coding, boring data is better than messy reality. The goal is to prove that the model can sort line items well enough to save time, not to win every edge case on day one.

A good first batch usually comes from two to five vendors. Pick suppliers who send regular invoices, use stable layouts, and bill for the same kinds of goods or services. If one vendor changes wording every month and another sends scanned PDFs with handwritten notes, leave them out for now.

History matters as much as format. Choose vendors where you already have a decent record of past coding decisions. That gives the team something solid to compare against, and it makes it easier to see where the model helps and where it guesses.

One business unit is enough for the first test. If you mix several entities, tax rules, approval paths, and account codes can shift under your feet. A narrow scope keeps the review simple and cuts down on arguments about whether an error came from the model or from different local rules.

A typical pilot might include three vendors with steady monthly invoices, six to twelve months of coded history, one finance team reviewing the results, and one business unit using the same coding rules.

Skip the odd cases in round one. Credit notes, split invoices, one-off project charges, and long mixed-department bills can wait. Those cases matter later, but they can hide whether line-item classification works at all.

Picture a simple test. One office gets monthly software, telecom, and cleaning invoices from the same three suppliers. Finance has already coded the last nine months by hand. That is a clean sample. The team can compare AI suggestions with past entries, measure review time, and decide whether the result is good enough to expand.

If the first group works, add complexity in layers. Bring in one more vendor type, then another business unit. That keeps the test honest and gives finance a result it can trust.

Set confidence thresholds before rollout

A pilot falls apart when every line gets the same treatment. In practice, the score rules often matter more than the model itself.

Start with two thresholds. The higher one marks a line as safe enough to move forward with the suggested code. The lower one sends the line to a reviewer. Anything below that lower mark stays manual until the team has more confidence in the results.

A simple starting point looks like this:

  • 0.95 and above: accept and move forward
  • 0.75 to 0.94: send for human review
  • below 0.75: keep manual coding

Treat those numbers as a draft, not a law. If one supplier sends the same clean invoice every month, finance may lower the auto-accept score a little. If tax treatment changes often or the descriptions are messy, keep the bar high. Stable categories deserve more automation than unusual spend.

Someone also needs to own the exceptions before the first invoice enters the pilot. If nobody owns them, they sit in a queue and the pilot looks slower than the old process.

Many teams do best with a simple split. AP checks vendor details and duplicate risks. The budget owner handles unclear spend. Procurement fixes supplier item mapping when descriptions drift. The controller takes tax or policy edge cases.

Set a response target too. One business day is a good start for normal exceptions. Urgent supplier invoices may need same-day review.

Track override reasons from day one. Do not rely on a free-text comment box alone. Use a short list such as wrong GL code, wrong cost center, split line needed, tax issue, or unclear supplier description. Free text can sit beside it for extra detail.

That small step gives finance two useful things. It makes review work cleaner now, and it gives you evidence later. If most overrides come from bad supplier text, the model is not the main problem. If errors cluster around one account code, the threshold or mapping needs work.

Run the pilot step by step

Get Fractional CTO Help
Work with Oleg on a small finance automation pilot before ERP change.

A pilot works best with real invoices, real coding history, and a small finance review group. Do not start inside the ERP. Start outside it, where people can test line classification without touching posting rules or supplier master data.

Begin by pulling a few months of paid invoices along with the final account, cost center, or project code that finance approved. Keep the original line text, supplier name, amount, tax details, and any reviewer notes.

Then clean the obvious data problems. Remove duplicate rows, merge split exports, and correct labels everyone knows are wrong. If the history says the same purchase belongs to two different accounts because of old mistakes, the model will learn that confusion.

Next, train or configure the model on your own finance data. Off-the-shelf settings rarely match your chart of accounts. Keep it simple at first: predict the most likely code for each line and return a confidence score.

Hold back one batch of invoices and do not show it to the model during setup. That gives you a fair test. Compare the suggestions with the codes finance actually posted, then measure where the model wins cleanly and where it struggles.

When the model is unsure, send those lines to reviewers instead of forcing an answer. Ask reviewers to correct the code and record why they changed it. Those changes become the next round of training data.

Keep the workflow boring on purpose. A spreadsheet export, a review screen, and a simple change log are enough for a first pass. Finance teams learn more from 200 messy real lines than from a polished demo with fake data.

Aim for faster review on easy invoices, not full automation. If the model codes common supplier lines correctly and flags uncertain cases, you already have proof that AI can reduce manual work before anyone opens a full ERP change project.

A simple example with one supplier

Pick a supplier that sends the same type of invoice every month. Office and hardware bills are a good starting point because the line items repeat. You might see printer paper, toner, monitors, keyboards, cables, and laptop stands again and again.

Say finance receives four invoices in one month from that supplier, with 120 total line items. Most of them look familiar. The model reads "A4 copy paper, 5 boxes" and codes it to office supplies with 0.98 confidence. It reads "24-inch monitor" and codes it to computer equipment with 0.96. It reads "wireless keyboard" and lands at 0.94.

Those common lines should move into a draft batch with very little friction. Finance still keeps control, but nobody needs to type the same decision 40 times. A quick review replaces line-by-line work.

Then an unusual charge appears: "On-site assembly service for 12 sit-stand desks." The supplier usually sells goods, not labor, so the model hesitates. Confidence drops to 0.61, which is below the review threshold. The system sends that line to a person, who picks the right code and moves on.

That exception matters. It shows that the model does not need to be perfect to save time. It only needs to handle routine work well and stop when it is unsure.

After one month, finance can measure the result with plain numbers:

  • 120 total line items received
  • 93 coded above the high-confidence threshold
  • 19 sent to a quick review queue
  • 8 routed to full manual coding
  • about 3 hours saved across the month

The math is straightforward. If manual coding takes about 90 seconds per line, 120 lines cost roughly three hours. If the team only reviews the low-confidence items and batch-checks the rest, that same month might take 20 to 30 minutes instead. That gives finance a clear test without changing the ERP first.

Mistakes that spoil the test

Work With Oleg
Ask Oleg to review your pilot scope before the project grows too wide.

Most pilot failures come from test design, not the model. Finance teams often feed the pilot messy inputs, weak review rules, and success measures that miss the real work.

One common mistake is mixing too many vendors in the first round. Every supplier labels products differently, lays out tables differently, and handles tax in its own way. Start with one or two vendors whose invoices look the same month after month. That gives the model a fair test.

Another problem is treating the confidence score as approval. A model can feel 94% sure and still miss the right account, tax code, or cost center. Set review rules before anyone sees results. For example, finance might auto-accept only very high scores, send mid-range scores to review, and hold low scores for manual entry. Without that structure, people end up arguing about single invoices instead of checking whether the pilot saves time.

Tax coding and split allocations also trip teams up early. An invoice may land on the right expense account but still post incorrectly because VAT was misread or one line should be split across two departments. Those cases matter because they create cleanup work later. If the pilot ignores them, the reported accuracy will look better than the real outcome.

Do not judge the test on accuracy alone. A pilot can hit 90% and still feel annoying if reviewers spend too long fixing the awkward 10%. Track a few practical measures instead: review time per invoice, the share of lines reviewers change, repeat error types, posting speed after review, and the exception rate for tax and allocations.

Another bad choice is starting with invoices that change format every week. Some suppliers send clean PDFs one month, scanned copies the next, and then credit notes in a different layout. Leave those for later. Use stable documents first, then add messy ones after the team trusts the process.

A small, boring sample is better than a broad, chaotic one. If the pilot works under clean conditions, expand it. If it fails there, a bigger test will only hide the problem.

Check the basics before you expand

Reduce Manual Coding Work
Let AI suggest codes on routine lines and send uncertain cases to people.

A pilot earns the next budget only if the results are easy to defend. Before you add more suppliers or business units, check whether the pilot reduced work in a way finance can see every day, not only in a demo.

Use a short scorecard and review it with the people who approve, correct, and post invoices. If AP staff still spend the same amount of time cleaning up line items, the pilot is not ready for a wider rollout.

Look at review time at the invoice-line level. If a reviewer needed 90 seconds per line before and now needs 25, that is a real gain. If the time only drops on simple invoices, note that too.

Check override rates by vendor and spend category. One supplier may format invoices cleanly while another uses vague descriptions that push people to change the suggested code every time.

Ask finance whether they trust the suggestions enough to move faster. Trust shows up in behavior. Reviewers stop double-checking every line when the system earns credibility.

Compare month-end close speed for the pilot group with a similar group outside the pilot. Even a small drop in late reclasses or last-minute AP cleanup is worth noticing.

Test whether your ERP accepts the coded output without awkward fixes. The import should map cleanly to fields, tax treatment, dimensions, and approval flows. If staff still need to copy and paste results by hand, expansion will stall.

A good pilot does not need perfect accuracy on day one. It needs stable rules, clear confidence thresholds, and a workflow that fits the ERP you already have.

One warning sign is uneven performance that nobody can explain. If office supplies work well but freight or contractor invoices keep going off track, pause and tighten the model, labels, or vendor rules before you expand. That is cheaper than cleaning up a messy rollout later.

What to do next

Put the pilot on paper before anyone asks for a bigger budget. One page is enough. It should say which vendors are in scope, which invoice fields matter, what counts as a correct classification, and who reviews exceptions. If the team cannot explain the test in a few plain sentences, the pilot is still too wide.

Keep four things fixed in the first round: the scope, the success target, the review rule, and the stop rule. In practice, that means naming the suppliers in scope, setting a target such as 85% correct coding above an agreed confidence threshold, deciding who checks low-confidence items and how fast, and agreeing on when to pause and fix errors instead of adding more volume.

That last point matters more than most teams expect. A small win can tempt people to expand too fast. Do not add ten more vendors because the first week looked good. Add a few only if the same error types stay under control and reviewers still trust the output.

If the results are mixed, stop and inspect the misses. Look for patterns. One supplier may use vague descriptions, one cost center may be confused with another, or tax lines may break the model. Fix those issues first. The point of an ERP change pilot is to prove that finance automation can save time without creating cleanup work later.

Keep the ERP core unchanged while you test. Export invoices, classify line items, send the result to a review queue, and post only approved entries back into the existing process. That gives finance real evidence without turning a focused test into a replacement project.

Some teams can scope this on their own. Others benefit from an outside view that keeps the pilot narrow and practical. Oleg Sotnikov, through oleg.is, advises companies on AI-first software and automation projects, and that kind of input can help finance teams test invoice coding and confidence thresholds without drifting into a full system rebuild too early.