Document extraction field definitions before model tuning
Document extraction field definitions often break projects before model tuning starts. Clean up labels, tolerances, and exception ownership first.

Why projects stall before model tuning
Teams often blame the model first. The extractor misses a total, swaps two dates, or drops a supplier name, and the next request is usually for better prompts, more samples, or a different vendor.
That reaction is common, but the trouble usually starts earlier. It starts when nobody has clearly defined what a field means, what counts as acceptable, and who decides the odd cases that do not fit the template.
Vague labels cause more damage than most teams expect. One reviewer uses "total" for gross amount. Another uses it for amount due. A third switches between tax included and pre tax values depending on the invoice. The model cannot infer business intent. It learns from the labels it gets, even when those labels conflict.
That is why field definitions matter more than early model tuning. If the definition sheet is loose, the training set fills up with small contradictions. You can spend weeks tuning and still get unstable results because the target keeps changing.
The warning signs show up early. Reviewers disagree on the same document. Failed cases get fixed by guesswork. Edge cases live in chat threads without a final rule. The team keeps asking for one more model update even though nobody has settled the basics.
Edge cases make the slowdown worse. Credit notes, multi page invoices, stamps over totals, bad scans, and missing purchase order numbers all need a decision. If nobody writes those decisions down, reviewers make them on the fly. Before long, the queue contains several versions of the truth.
Ownership is often the missing piece. Finance may know what "amount due" means for reconciliation. Operations may know when to reject a scan instead of forcing extraction. Without that input, the model team ends up acting as referee for business rules it does not own.
A better start is plain process work. Define each field in one sentence. Set the allowed mismatch before anyone tunes the model. Decide who answers exception questions within a day. Once those rules are stable, prompt changes and model updates have a fair shot.
Start with clear field meanings
Good field definitions stop arguments before they reach the model. If two reviewers read the same invoice and pick different values for the same field, tuning will only teach the system mixed rules.
Write one plain sentence for every field. It should be simple enough that a new reviewer can read it once and label the document the same way as everyone else.
Each definition needs to answer four basic questions: what the field means, what counts as a match, what similar content does not count, and whether the result must be exact or can fall within a small tolerance.
Similar fields create the most waste. "Invoice date" is the date the seller issued the invoice. "Due date" is when payment is expected. A service period is something else again. Teams often call all of them "date" at first, then spend days cleaning up labels later.
The same problem shows up with names and totals. A logo may show a brand name while the legal supplier name appears in the footer. You need to decide which one belongs in "vendor name." For amounts, define whether "total" means before tax, after tax, or the final amount due after credits and discounts.
Exactness matters too. Some fields need a perfect match because one wrong character changes the meaning. Invoice number, purchase order number, and tax ID usually belong in that group. Other fields can allow a small range. A total amount may differ by 0.01 because of rounding. Weight, quantity, or unit price may need a tolerance if the source documents vary.
There is an easy test. Give five sample documents to two people and ask them to label the same fields without talking. If they disagree, the definitions are still too loose. Fix the sheet first. Model work gets easier when people already agree on what each field is.
Set tolerances before you train
A model cannot guess what "close enough" means. If your team rejects one invoice because the total is off by $0.01 but accepts another with the same gap, tuning will not fix that. The rule is unclear.
For each field, define the pass range in plain language. Dates may allow "03/04/2025" and "2025-04-03" if both map to the same day. Currency fields may accept a one cent difference when tax or exchange rounding creates a small gap.
Do not apply one rule to every number. Unit price, quantity, tax, and grand total often need different tolerances because people verify them in different ways and vendors format them differently.
Blanks need rules too. A missing purchase order number may be normal for one supplier and a real error for another. A blank tax amount may be fine on a zero tax invoice, but a blank invoice date usually needs review.
If you skip this step, reviewers start marking the same blank as both correct and incorrect during training. That adds noise to the dataset, and the model learns mixed signals.
Low confidence reads need a clear path as well. OCR quality checks help, but a confidence score does not settle the question on its own. A field with 92% confidence can still be wrong if the scan is cut off. A supplier name at 60% may still be easy for a person to confirm.
A simple definition sheet should say which formats are valid, how much numeric difference is allowed, when a blank is valid or an error, and who reviews the field when OCR or the model is unsure.
This is where field tolerances start saving time. When the team agrees on limits first, reviewers make the same call on the same data.
For invoice data extraction, even one small rule can save days of rework. If subtotal plus tax differs from total by no more than $0.01, accept it. If the due date is unreadable, send it to accounts payable for review. Clear limits like that make the exception handling workflow faster, and model tuning starts from cleaner labels.
Give every exception an owner
Most extraction queues get messy for a simple reason: nobody knows who decides. A low confidence total lands with the ML team, then moves to finance, then back to operations. Two days pass and nothing changes.
Write the owner next to each exception type in the definition sheet. Use a real person or a real role, not "team" or "business." If three people could own an issue, nobody really owns it.
The split should be simple. Model errors go to the person managing prompts, training data, or extraction rules. Business rule disputes go to the process owner, often finance or operations. Unreadable scans go to the person who can request a new document. Recurring edge cases go to someone who can approve a permanent rule change.
This saves a lot of wasted tuning. If the model reads "INV-1186" as "INV-1188," that is an extraction problem. If the invoice shows two tax lines and the ERP accepts only one, that is a business rule problem. Teams often mix those up and spend weeks tuning a model that already reads the page well enough.
Unreadable scans need their own owner because they block the whole workflow. Someone has to decide whether to reject the file, ask the supplier for a better copy, or allow manual entry. If nobody owns that call, bad scans pile up and ruin the metrics. Then the model gets blamed for documents no person could read with confidence.
Recurring edge cases need a fast path. If one vendor always puts the purchase order number in the footer instead of the header, reviewers should not keep fixing it by hand. The owner should decide whether to add a supplier rule, adjust tolerances, or change the review flow.
This comes up often in real AI automation work. The model is rarely the slow part. Decision ownership is. Put names on exceptions first, and the tuning work gets smaller, faster, and easier to measure.
Build the definition sheet from real documents
Good field definitions usually start with a small batch of real files, not a whiteboard. Pull 20 to 50 documents that reflect the mess you expect in production: different vendors, layouts, scan quality, languages if needed, and a few ugly edge cases. If you only use clean samples, the process will look better than it really is.
Put every field into one shared sheet that operations and engineering can both edit. One table is enough to start. It does not need to look elegant. It needs to make every field mean one thing every time.
At the start, the sheet only needs a few columns: field name, plain language meaning, accepted examples, rejected examples, and where the field usually appears on the document.
Accepted and rejected examples save a surprising amount of time. They show reviewers how to label difficult cases, and they expose weak definitions before the model learns them.
A simple invoice example
Take an invoice from "Northwind Supplies LLC." On older invoices, the same company appears as "Northwind Supplies" or "NW Supplies." If your system treats those as three separate vendors, matching breaks quickly.
Vendor name needs alias rules in the definition sheet. Pick one approved supplier record and map the common name variations to it. That prevents fake duplicates when the header changes.
Now look at the amount fields. The invoice shows a subtotal of $980.00 for parts, a $30.00 discount, $25.00 shipping, and $78.75 tax. The final amount due is $1,053.75.
If the team defines "total amount" loosely, people start blaming the model when it extracts $1,053.75 instead of $980.00. The model is not the problem. The field definition is.
Each amount needs one plain meaning. Subtotal is the sum of line items before discount, shipping, and tax. Discount is the amount subtracted from the subtotal. Shipping is the delivery charge on the invoice. Tax is the sales tax or VAT shown by the supplier. Total amount is the final amount due after all adjustments.
Purchase order fields cause another common problem. Some invoices arrive without a PO number because someone made a small one off purchase. Others should never pass without one.
Set that rule before you tune anything. If invoices over $500 must include a PO number, or a supplier normally bills against a purchase order, a missing PO should trigger review. The extractor can still capture the rest of the invoice, but the system should route it to finance or purchasing instead of marking it complete.
That simple invoice shows where teams lose time. They assume they need better invoice data extraction, but the real gap is usually field meaning, alias rules, and review triggers.
Mistakes that waste weeks
Most lost time comes from messy decisions, not weak models. Teams often blame OCR or the model when the real problem sits in the spreadsheet that defines the fields.
One common mistake is packing several business rules into one label. "Invoice date" sounds simple until one reviewer means issue date, another means service period start, and finance wants due date when the first one is missing. The model cannot learn a stable target if people cannot name it the same way.
Another slow problem starts when labels change in the middle of annotation. A team may begin with "vendor name," then switch to "supplier legal entity," then ask reviewers to accept trade names in some files but not others. That turns the training set into a mix of old and new rules, and the model starts looking worse than it is.
The same thing happens when reviewers correct outputs but never record why. If someone fixes "total amount" fifty times because tax was included on some invoices and excluded on others, that note matters more than another tuning run. Without a reason log, the team repeats the same argument every week.
Mixed document sets also slow teams down. Training on utility bills, invoices, receipts, and purchase orders all at once feels efficient, but it hides simple errors. A better start is one document type, one field sheet, and one review rule.
Small accuracy gains can become a trap. Teams spend days trying to move a field from 96% to 97% while reviewers still copy values by hand, route exceptions in email, and wait hours for decisions. Fix the review flow first. Saving ten minutes per exception often beats another round of model tweaks.
If progress feels stuck, pause tuning for a day. Clean the labels, freeze the rules, and make reviewers explain overrides in a few words. That reset often saves more time than a week of experiments.
Quick checks before another tuning round
Teams blame the model too early. Weak definitions cause more trouble than a mediocre model because the model learns whatever the team labels as true.
Before you spend another week tuning, run a short review on real documents. Use bad scans, odd layouts, and missing data, not just neat pilot samples.
Ask two reviewers to label the same small batch on their own. If they disagree on what a field means, fix the definition before touching the model. Write accepted formats for every field. For invoice data extraction, that means examples for dates, invoice numbers, tax IDs, totals, and notes on what counts as close enough. Check field tolerances in plain language. Decide whether "$1,250" and "$1,250.00" match, whether "03/04/25" is acceptable, and whether a supplier name can differ by one missing word.
Give each exception one owner. If totals do not match line items, one named person or team should decide what happens next. Then test the rules against messy cases such as rotated pages, stamps over text, duplicate invoices, phone photos, and documents where a field should stay blank.
That last part matters more than teams expect. A field definition is not complete until the team can explain when not to extract the field. If a document has no due date, the system should leave it empty instead of guessing.
OCR quality checks belong in the same review. If the text layer drops decimal points or reads "8" as "B," prompt tuning will not fix every downstream error. Catch those failures early, then decide whether you need preprocessing, a different OCR setup, or a manual review step.
A good exception handling workflow looks boring on paper. That is usually a good sign. When the team knows what each field means, which variations are acceptable, and who handles edge cases, the next tuning round gets faster and cheaper.
What to do next
If your team still argues about what a field means, stop changing prompts and fix the sheet first. For each field, write the label, accepted formats, source location, tolerance rules, and the person who decides edge cases.
That one document often does more for extraction quality than another week of tuning. If the rules are fuzzy, the model, reviewers, and operations team will all make different guesses.
Then run a small pilot with reviewed documents. Pick 50 to 100 real files, keep the document types narrow, and make sure a reviewer checks every result by hand. A short pilot exposes bad rules faster than a large rollout.
Add a few OCR quality checks before blaming the model. Flag blurry scans, rotated pages, missing pages, and low contrast files. Many extraction failures start much earlier.
Do not judge the pilot by accuracy alone. Track disagreements between reviewers, operators, and the model. If two people label the same invoice differently, the score can look fine while the process is still broken.
Your pilot log should answer a few plain questions: which fields cause the most disagreement, whether the issue came from OCR, the field rule, or the document layout, what tolerance the team applied, and who owned the exception and closed it.
Once that log is clear, change prompts or models if you still need to. Then you can test updates fairly. You will know whether a change improved invoice data extraction or whether the real problem still sits in field tolerances and exception handling.
If review loops keep eating time, it helps to bring in someone who can assess the whole workflow, not just the model. Oleg Sotnikov does that through oleg.is as a fractional CTO and startup advisor, with a strong focus on practical AI software and operations. A second pass on the field sheet, ownership rules, and process design often cuts more wasted work than another tuning cycle.