Jun 15, 2025·8 min read

Retrieval chunking by document type that improves RAG

Retrieval chunking by document type helps models pull cleaner context from contracts, manuals, tickets, and policies with fewer missed facts.

Why one chunk size fails

One chunk size assumes every document stores meaning the same way. It doesn't. That's why retrieval chunking by document type usually beats one global rule.

Contracts are dense. One short clause can change payment terms, liability, or ownership. Definitions matter too, because a word in section 12 may depend on text near the start. If you split a contract into chunks that are too small, search may find the clause but miss the definition that makes it readable.

Manuals break in the opposite way. A task often runs across steps, warnings, notes, and a table with settings or limits. If a chunk cuts between step 3 and the warning under it, the model may return the action and miss the safety note. If chunks are too large, retrieval pulls in extra material and buries the instruction that matters.

Tickets are messy. Some are a few lines with the problem and fix. Others hold a week of back-and-forth, logs, guesses, and status updates. A fixed chunk size either breaks short tickets into pieces that lose context or packs long threads into big blocks where the answer sits next to a lot of noise.

Policies depend more on structure than length. Scope, exceptions, approval rules, and update dates often live in separate parts. If search returns the rule without the exception, the answer can sound confident and still be wrong.

One size can work in a demo. It usually breaks in production.

What to inspect before you split

Before you set chunk rules, look at how people actually read each document. A contract gets read by clause, definition, and exception. A manual gets read by heading, step, and warning. A ticket gets read as a conversation that ends with a fix.

Good chunks usually follow the same boundaries a person would use. Mark the places where meaning naturally changes: section headings, numbered clauses, step blocks, reply turns, tables, and appendices. If a reader would pause there, your splitter probably should too.

Boilerplate needs special attention. Repeated headers, legal footers, signature blocks, template phrases, and copied disclaimers can flood retrieval with text that says very little. Keep them as metadata, drop them, or down-rank them. If they stay inside every chunk, search keeps finding the same noise.

You also need to decide what must stay together. A contract clause may depend on a definition from page one. A manual step may fail without the warning right above it. A support ticket often makes no sense unless the final reply stays close to the original problem. A policy section may need its scope, exception, and effective date in the same chunk, or at least in tightly linked chunks.

A quick review usually answers four questions: which boundaries readers rely on most, which repeated text adds noise, which nearby lines change the meaning, and which fields the model should cite.

That last part matters more than many teams expect. If you want answers people can trust, store clear citation fields such as document title, section number, clause ID, version, date, and ticket ID. A retrieved passage is much easier to verify when the model can point to the exact place it came from.

How to split contracts

Contracts rarely break cleanly at a fixed length. Meaning usually lives at the clause level, so split by legal structure first and token count second.

Keep each clause with its heading and numbering. If a section reads "7.2 Limitation of liability," store that full label with the text. The number matters because people ask things like "what does 7.2 allow?" and nearby clauses often look almost identical.

Definitions need their own chunks. A contract may use terms like "Confidential Information" or "Cause" in ten places, but the actual meaning sits in one definition section. Store that definition as a separate chunk, then link it to the clauses that use the term through metadata or references.

Exceptions cause a lot of bad answers. If one clause sets a rule and the next sentence says "except as provided in Section 9.4," keep the exception with the rule it changes, or create a combined retrieval unit. If you split them apart, the model may return the rule and miss the carveout.

Schedules and exhibits need their own logic. Don't glue Exhibit A to the last body section just because it comes next in the file. Split each schedule, exhibit, appendix, or pricing table by its own headings, rows, or numbered items.

For metadata, keep it simple and consistent: party names, effective date or signed date, document version or amendment number, section number and title, and document part such as body, schedule, or exhibit.

A small example makes this clear. If a service agreement says the customer can terminate on 30 days' notice, but Schedule 2 changes that for annual plans, retrieval should bring back both the termination clause and Schedule 2. That only happens when your chunks preserve structure instead of cutting the document into equal pieces.

How to split manuals

Manuals work best when each chunk helps someone finish one task from start to end. Page count is a weak guide. One page may hold three tiny actions, or one action may run for several pages.

Start with the user goal. "Replace the air filter," "reset the controller," and "calibrate the sensor" should usually live in separate chunks. That gives the model a clean unit of meaning instead of a random slice of text.

Keep setup details with the first step. If a procedure needs a tool, a part number, a shutdown step, or a safety check, attach that text to the opening chunk for the task. If you split those details away, the model may return the steps and miss the part that prevents a bad result.

Warnings should stay close to the step they affect. If step 6 says to disconnect power before opening a panel, keep that warning in the same chunk as step 6 or immediately before it. Don't move all warnings to a separate safety chunk unless the manual already treats them as general rules.

Long procedures still need smaller pieces, but the split should follow step ranges with plain labels. For example, a task might break into "steps 1-4: prepare and remove the cover," "steps 5-8: replace the filter," and "steps 9-11: reassemble and test." Labels like that help retrieval and also help the model explain where the answer came from.

Metadata matters here too. Store the product name, model number, and manual section or chapter. If two models share similar instructions, that metadata often makes the difference between a correct answer and a very confident wrong one.

How to split tickets

Design Better Chunk Rules

Map contracts, manuals, tickets, and policies to split rules that fit real use.

Start Review

Support tickets get messy fast. The newest reply often says "still broken" or "same error again," which means almost nothing on its own. Put the first problem report in the same chunk as the latest useful status so search returns both the issue and where it stands now.

That works better than slicing a thread into equal pieces. A model can answer a support question when it sees the original symptom, the current state, and the account or product details in one place.

Long threads need another rule: split them when the topic clearly changes. If a login issue turns into a billing dispute, or a bug report becomes a feature request, break the thread there. If you keep unrelated turns together, retrieval pulls noise and the answer drifts.

Logs, stack traces, and screenshot notes should stay next to the message that explains them. A raw error dump without the sentence before it is hard to use. A screenshot note that says "see attached" is useless unless the surrounding message explains what the user saw and what they clicked.

It also helps to store a few parts as separate fields instead of burying everything in one text block. Resolution, workaround, and root cause should each get their own field when you can extract them. Then retrieval can match a user asking for a quick fix without mixing it up with the deeper cause.

Clean the thread before you chunk it. Remove greetings, signatures, footer text, and repeated quote blocks from email replies. Those lines bloat the index and push the real issue farther away.

In practice, a useful ticket chunk often includes the original report, current status, relevant log note, workaround, and final resolution. That's usually enough context for the model to answer without dragging in the whole thread.

How to split policies

People usually search policies for one thing: the rule they need to follow now. Each chunk should make that rule clear without forcing the model to pull five nearby sections just to answer a simple question.

A good policy chunk keeps the scope, the rule, any stated exception, and the owner close together when they live in the same section. If a travel policy says who it applies to, the spending rule, and the exception for executives in one block, keep that block whole. Splitting those parts apart often leads to answers like "approval required" without the spending limit or the exception.

Approval history is different. Old review notes, sign-off chains, and change logs often add noise. Store them, but split them away from the active policy text so retrieval doesn't treat a past discussion as the rule itself.

Dates matter more in policies than in many other documents. Keep the effective date with the active chunk, and mark whether the document is current, draft, or superseded in metadata. If your system stores older versions, keep them searchable, but make it easy for retrieval to favor the current one first.

Don't break tables into tiny pieces. If a policy has limits, thresholds, or approval bands in a table, keep the full table with its heading and any note that explains how to use it. A chunk that contains only "up to $500" is close to useless if the department or approval path sits two chunks away.

Useful metadata for policies usually includes audience, department, policy type, effective date, and version status. Policies are not tickets or manuals. They need version control, clear ownership, and enough local context to answer "what is the rule now?" without mixing in old text.

A simple way to set rules

Guessing chunk rules usually creates a messy index. A better approach is to pick one document type first, collect about 20 real files, and study how the information actually sits on the page.

Use real contracts, real manuals, or real support tickets. Sample files show patterns quickly: short clauses, long sections, repeated headers, copied signatures, reply chains, and tables that break clean text into awkward pieces.

Before you write code, mark chunk boundaries by hand on a few documents. A plain note or spreadsheet is enough. Put the split points where a person would naturally pause and think, "this part answers one question."

Then test with searches people already ask. Keep them plain. Questions like "What ends the contract early?" "How do I reset the device after an update?" "Why was ticket 184 reopened?" and "Who approves exceptions to this policy?" tell you much more than tidy benchmark prompts.

Those searches show whether one chunk can answer the question or whether the model needs two or three nearby chunks together. That matters more than hitting a neat token count. If answers keep spilling across boundaries, your chunks are too small or split in the wrong place.

Change one setting at a time. Adjust size first, then overlap, then metadata. If you change all three at once, you won't know what fixed the result or what made it worse.

Metadata often does more work than people expect. A contract chunk needs section names and clause numbers. A ticket chunk needs ticket ID, speaker, and date. Keep the rules simple enough that someone else on your team can read them and apply them the same way next week.

That's usually enough to build a solid first version without weeks of tuning.

A simple example

Make RAG Easier to Trust

Build clearer citations and better document structure for answers people can verify.

Book Call

A customer writes to support and asks, "Can we end this contract early?" If your RAG system stores the whole agreement in one large chunk, search may return ten pages of mixed terms. The model then has to sort through payment rules, liability, definitions, and signatures. That often leads to fuzzy answers or missed details.

With document-specific chunking, the contract gets split by legal structure instead. Search can pull the early termination clause, the notice period from the notices section, and the breach definition if termination depends on default. Those pieces belong together, so the model can answer with the right section instead of dumping the whole agreement back at the user.

A useful answer is short and specific:

The contract allows early termination for material breach.
The other party gets 30 days' notice to fix the breach.
If they don't fix it in that time, the agreement can end under the termination section.

Now compare that with a product manual. A technician asks how to replace a filter. The best result isn't the full chapter. It's one task chunk with the steps, plus the warning that says to power down the unit first. If search returns only the warning, the answer is incomplete. If it returns twenty pages, the model may bury the warning or miss the exact task.

The same retrieval system can handle both questions well, but only if the chunks match the document. Contracts need clause-level splits with definitions and notices nearby. Manuals need task-level splits with warnings kept close to the action they affect. That's how search gives the model context it can actually use.

Common mistakes

Many teams build one splitter, one overlap setting, and one cleanup rule for everything. That looks tidy, but it hurts fast. Contracts, manuals, tickets, and policies don't carry meaning in the same way, so they shouldn't break apart in the same way either.

A common miss is using the same overlap for every file. Contracts often need small, careful overlaps around clause boundaries, references, and definitions. Manuals usually work better when a whole step, warning, or prerequisite stays together. Tickets are different again. Long overlaps can repeat noise and push the real fix farther down the ranking.

Another mistake is cutting numbered clauses in the middle. If you split "8.2 Limitation of liability" halfway through an exception, the model may retrieve a fragment that says the opposite of the full clause. Keep the clause number, heading, and full thought in one chunk whenever you can.

Policy retrieval often breaks for a simpler reason: teams mix active policy text with old versions in the same index. Then search returns a retired rule beside the current one, and the model blends them. Store version status, effective date, and owner in metadata. Archive old policies separately or filter them out by default.

Support tickets create a different kind of mess. Many teams keep every reply, even when half the thread says "thanks" or repeats the same failed step. That adds volume, not facts. Keep the first report, the useful troubleshooting steps, the final diagnosis, and the confirmed fix. Collapse the rest.

Metadata gets ignored too often as well. Text alone rarely tells the model whether a chunk is a draft, a current policy, a contract appendix, or a solved ticket. Add document type, version, section number, product area, and date. Without that, retrieval has to guess.

Quick checks before launch

Move Faster With AI

Use practical CTO advice to improve automation and modernize your software workflow.

Book Time

A quick test tells you whether this approach is helping or just making the index look neat. Run the test with real questions, not prompts written for a demo.

Use five real questions for each document type. For contracts, ask something like "How much notice ends this agreement?" For manuals, ask "What steps reset the device?" For tickets, ask "What fixed this bug last time?" For policies, ask "Who approves an exception?" Use the wording your team already uses in search, chat, or support.

For each answer, check the chunk that came back first. It should be the smallest useful chunk, not half the document. If the answer only makes sense after you open three nearby chunks, your split is too small or your overlap is weak. If the chunk is full of extra text before the answer appears, it's probably too large.

Two details often get lost during splitting: context labels and repeated boilerplate. Dates, versions, owners, section names, and document status need to stay visible inside the retrieved chunk or very close to it. If a policy answer appears without its version date, people may trust stale guidance.

Then look for boilerplate that keeps winning search. Repeated headers, legal footers, template intros, and copied disclaimers can drown out the useful part. Strip them, down-rank them, or keep them out of embeddings.

When you tune the system, make one change at a time: test the current setup, change one split rule, rerun the same question set, compare the top results, and keep only the changes that improve retrieval.

It's a slow afternoon. It saves weeks of chasing odd answers later.

What to do next

Pick one document type first. Start with the one that causes the most wrong answers, or the one where a bad answer costs the most time. For many teams, that's contracts or support tickets, because one bad split can hide the exact clause or fix the model needed.

Keep the rules simple enough that anyone on the team can explain them in a minute. If the rule needs a long note full of exceptions, it will drift quickly. Plain rules usually work better: split contracts by clause, manuals by task, tickets by issue and resolution, and policies by section plus exceptions.

Then check search logs every week. Look at failed queries, noisy matches, and chunks that pack together ideas that don't belong together. Small fixes to chunk boundaries often help more than swapping models.

A short review routine is enough. Collect ten bad searches from the last week, compare the top result with the chunk you wanted to see, find the split rule that caused the miss, change one rule, and test again on the same searches.

You don't need a perfect scheme on day one. You need a setup your team understands, can maintain, and can improve without guessing.

If you want a second opinion, Oleg Sotnikov at oleg.is works as a Fractional CTO and advisor on AI-first development and production retrieval systems. That kind of outside review can help you spot which chunking rules are doing real work and which ones just make the index bigger.

Do the first pass this week. One document type, one page of rules, one review session based on real search logs.

Frequently Asked Questions

Why does one chunk size fail across different document types?

Because documents carry meaning in different ways. A contract often needs a clause plus a definition or exception, while a manual needs a full task with its warning nearby.

One global size may look neat, but it often breaks real answers by cutting context in the wrong place or stuffing in too much noise.

What should stay together when I split contracts?

Keep the clause number, heading, and full clause text together. If the clause depends on a definition, notice term, or exception, keep those close through overlap or linked retrieval.

That gives search a fair chance to return the rule and the text that changes its meaning.

How should I handle definitions and exceptions in contracts?

Store definitions as their own chunks and connect them to the clauses that use those terms. For exceptions, keep the carveout with the rule it changes, or build a combined retrieval unit.

If search finds only the rule and misses the exception, the answer can sound clear and still be wrong.

What is the best way to chunk manuals?

Split manuals by task, not by page or fixed token count. A chunk should help someone finish one job, such as replacing a filter or resetting a controller.

Keep tools, setup steps, part numbers, and warnings with the steps they affect. That prevents answers that give the action but miss the safety note.

How do I split long support tickets without losing context?

Put the first problem report near the latest useful status, workaround, or resolution. That keeps the issue and the outcome in one place instead of scattering the story across random slices.

When the topic changes, split the thread. Also remove greetings, signatures, and repeated quote blocks before you index it.

How should I chunk policies with versions and exceptions?

Keep the active rule, scope, exception, owner, and effective date close together. Store old approval notes and change logs separately so search does not mix past discussion with the rule people need now.

Mark each policy as current, draft, or superseded in metadata. That simple step cuts a lot of wrong answers.

What metadata matters most for retrieval?

Start with document title, section name or number, version, date, and document type. Then add fields that fit the source, such as clause ID for contracts, model number for manuals, ticket ID for support threads, or owner for policies.

Good metadata helps the model cite the right place and helps retrieval rank the right chunk first.

How can I tell if my chunking rules actually work?

Use real questions from your team, support inbox, or search logs. Check whether the top result is the smallest chunk that answers the question without forcing you to open half the document.

If answers spill across too many chunks, your splits are too small. If the chunk buries the answer in filler, it is too large.

What boilerplate should I remove before I index documents?

Repeated headers, legal footers, signatures, template intros, copied disclaimers, greetings, and email quote chains usually add noise. They make search find the same junk again and again.

Drop them, store them outside embeddings, or down-rank them. Keep the text that changes meaning and helps answer a real question.

Where should I start if my RAG index already feels messy?

Pick one document type that causes the most bad answers and write simple rules for that type first. Mark split points by hand on a small sample, test with real searches, and change one setting at a time.

If you want a second opinion, an experienced Fractional CTO like Oleg Sotnikov can review your retrieval setup and help you fix chunk rules that add noise instead of useful context.