Knowledge base cleanup before you feed an assistant
Knowledge base cleanup helps your assistant stop citing old pages, duplicate articles, and conflicting policies. Use this practical process before reindexing.

Why bad content creates bad answers
An assistant does not read your knowledge base like a careful editor. It pulls the closest match and builds an answer from that. If the closest page is messy, old, or copied three times with small edits, the reply can sound polished and still be wrong.
That is why cleanup matters before you tune prompts or swap models. Retrieval picks what looks similar, not what is true. A polished but outdated page often beats a short, correct page because it shares more words with the question.
Conflicts make this worse. One page says customers have 14 days to cancel. Another says 30. Both look official. The assistant may choose one today and the other tomorrow, even when the question is the same.
Old policies are especially risky because they sound current. A page written two years ago rarely says, "Do not use this anymore." It reads like a live rule, so the assistant repeats it with the same confidence. Users treat the answer as policy, and support teams end up cleaning up the damage by hand.
Small content problems spread farther than most teams expect. A duplicate FAQ, a draft left in the wrong folder, or a title that says one thing while the body says another can keep showing up in retrieval. One weak page turns into the same bad answer over and over.
Bad content rarely stays isolated. It turns into inconsistent support replies, extra tickets, and internal confusion about which rule is real. If your assistant gives strange answers, the problem often started long before the model. It started in the documents you fed it.
What to review first
Start with a plain inventory. Before you delete or rewrite anything, write down every place the assistant can read today. Most teams miss at least one source, and that is how bad answers survive a cleanup.
That list usually includes help center articles, policy pages, SOPs, FAQ entries, support macros, exported PDFs, old docs in shared drives, and notes pasted into chat tools. If the assistant can reach it through search, sync, upload, or a connector, count it.
For each source, track four things:
- where it lives
- what type of content it holds
- who reads it
- whether people still use it to answer real questions
The type label matters more than it seems. Put policies together. Group help docs together. Keep SOPs in their own bucket. Do the same for FAQs, release notes, and internal troubleshooting notes. Once content is sorted this way, conflicts show up fast. A refund policy should not sit quietly beside an old FAQ answer that says something else.
Usage matters too. Some pages look official but nobody trusts them. Others are messy, yet support agents use them every day because they answer real customer questions. Mark both. The first group may be safe to retire. The second group needs a closer review before you touch it.
Keep customer-facing content separate from internal notes from the start. That one boundary saves a lot of trouble later. Internal notes often contain shortcuts, exceptions, or temporary advice that makes sense for staff but sounds wrong or incomplete when an assistant says it to a customer.
If you work with a small team, you do not need a big system. A simple sheet is enough: one row per source, one owner, one last review date, and one status such as active, stale, duplicate, or internal-only.
That first pass gives you a map. Without it, you are not cleaning the system. You are guessing.
How to find duplicates and near matches
Start with topic clusters, not titles. Two pages can answer the same question while looking different on the surface. A page called "Reset your password" and another called "Change login access" may be the same document in disguise.
This is where cleanup pays off quickly. Retrieval does not care that two pages came from different teams. If both pages cover the same task with slight differences, the assistant may pull either one and answer with confidence even when one is wrong.
The first clues are usually simple:
- the opening paragraph promises the same outcome
- the steps match, but the wording or order changes a little
- the screenshots show the same screen in different UI versions
- one page repeats most of another page and adds a small note
- two pages use different titles for the same internal process
Open likely matches side by side and compare the intro, the steps, and the screenshots. Small drift matters. If one page says "Go to Billing" and the other says "Go to Plans," someone probably copied an older page and only edited part of it.
Pick one main version for each topic. Choose the page with the clearest scope, the most complete steps, and proof that someone still owns it. Move any useful detail from the extra page into the main one.
Do not keep near copies just because each has one good paragraph. Merge what helps, then archive or delete the extra page so the assistant stops treating both as equal truth.
Sometimes two similar pages should stay separate. When that happens, make the difference obvious in the title and opening lines. "For customers" and "For internal finance staff" is clear. Two vague titles are not.
If you cannot explain why both pages exist, you probably only need one.
How to handle stale pages
A stale page can do more damage than a missing page. If an assistant pulls an old setup guide or last year's support rule, it gives a wrong answer with total confidence. During cleanup, treat older content as untrusted until someone confirms it still matches current work.
Start with the clearest age signals: last updated date, product names, version numbers, and screenshots. Old screenshots are often the fastest clue. If a page shows a menu or button that no longer exists, the instructions around it probably need review too.
Pages about retired features need a separate path. Do not leave them mixed with live instructions. Move them into an archive, label them as historical, and keep them out of the retrieval set if they no longer describe current rules or workflows. A search system should not read a migration note from two years ago as today's policy.
A simple stale-page audit usually finds most of the trouble:
- articles with no review in 6 to 12 months
- pages that mention old product names or team names
- guides with screenshots from past layouts
- workflows built around tools your team already replaced
Some old pages still help people. Changelogs, migration notes, and past incident writeups can matter for context. They just should not sit beside live policy where the assistant can mistake them for current guidance. Keep the history, but separate it clearly.
For content that changes often, add a visible review date and an owner. Pricing rules, access steps, support flows, and compliance notes go stale fast. A short line at the top is enough: who reviewed it, and when they should check it again.
A common example is a team that keeps a page for a legacy admin panel after moving to a new dashboard. Humans can tell it is old because the screen looks wrong. An assistant cannot. If that page stays in the same index as current docs, retrieval may keep surfacing it because the wording still matches the user's question.
That is how stale content stays alive long after the workflow changed.
How to fix policy conflicts
When two pages give different rules, your assistant can answer with the wrong one and sound sure about it. One page says refunds are allowed for 30 days. Another says 14. Retrieval finds both, and the model guesses.
Start by pulling every conflicting statement into one working document. Put the exact lines side by side, note where each one lives, and add the last update date if you have it. Keep everything in one place until the rule is settled.
Then assign one decision maker. Pick the person who owns the policy, not the person who uploaded the file or edited the wiki. When several people half-approve a rule, the conflict usually survives and keeps poisoning answers.
A simple process works well:
- Copy each version of the rule word for word.
- Name the team or person who owns the decision.
- Choose the current rule and write the reason in one sentence.
- Mark every page that now disagrees for removal or archive.
Once the owner chooses the rule, rewrite it in plain language. Short sentences beat legal-sounding paragraphs. If a support agent can read it once and repeat it correctly, the wording is probably clear enough for a model too.
Watch for hidden conflicts. They often sit in FAQ pages, old PDF handbooks, help center articles, saved email templates, and internal playbooks. These leftovers cause more trouble than the obvious policy page because they look specific and trustworthy.
Do not stop after publishing the new version. Remove or archive every older page that says something else. If you leave the old page live "for reference," retrieval may still pull it in. An archive folder that is excluded from indexing is fine. A public page with outdated rules is not.
Add one short note for future editors: who owns this policy, when they reviewed it, and where changes should happen. That tiny note prevents the same conflict from coming back three months later under a different filename.
A simple cleanup workflow
Most teams make this harder than it needs to be. The best process is short, clear, and boring. That is a good thing.
Start with the inventory. List every page, PDF, help article, FAQ entry, and policy note that might enter retrieval, then give each one a status: keep, update, merge, archive, or delete.
That single label does a lot of work. It stops the usual mess where an old page, a draft, and a newer version all sit side by side and look equally trustworthy to the assistant.
Then work in order.
Delete junk first. Remove test pages, empty drafts, broken imports, and anything you would never want an assistant to quote.
Merge duplicate documents next. If two pages answer the same question, keep the clearer one and move any missing details into it.
Rewrite conflicts after that. When refund terms, access rules, or support promises do not match, choose one current rule and retire the rest.
Then pick one approved page for each common question. If people often ask about billing or cancellations, one page should carry the final answer.
After that, do a short quality pass. Fix vague titles, add a clear owner, and note when someone last reviewed the page. "Policy final v2" is a bad source name for both humans and retrieval.
Do not reindex in the middle of the cleanup. If you reindex half-fixed content, the assistant can still pull old claims with the same confidence as the corrected version.
Finish one full cleanup round, then reindex once. That gives you a cleaner baseline and makes the next round much faster.
It is not glamorous work, but it pays off quickly. A small team can often sort dozens of pages in one focused session, and the assistant stops sounding certain about content nobody should have kept.
Example: refund policy confusion
A customer asks a simple question: "How long does a refund take?" The assistant searches the knowledge base and finds three different answers. The help center says refunds take 30 days. The sales FAQ says 14 days. An internal SOP tells support agents to offer store credit first.
The model does what retrieval systems often do when the source material is messy. It blends all three into one polished answer: "Refunds usually take 14 days, though some may take up to 30 days. Store credit may be offered first." It sounds calm and sure. It is also wrong.
The problem is not the model's tone. The problem is that each document had a different job, and nobody marked that clearly. A public help article answers customers. A sales FAQ may simplify terms to help close deals. An internal SOP tells agents what to try before they approve a cash refund. When all three sit in the same index without labels or priorities, the assistant treats them as equally valid.
That is why cleanup should happen before you turn on retrieval. The assistant cannot resolve policy conflict on its own if the documents disagree.
A clean fix usually looks like this:
- choose one public refund policy as the final source
- update or remove older pages that contradict it
- keep agent instructions in a separate internal collection
- add a clear owner and review date to the policy page
- reindex only after the conflicting content is gone
If you skip that work, the assistant keeps returning garbage with confidence. A customer may expect cash in 14 days, support may push store credit first, and finance may still process refunds on a 30-day timeline. That gap creates tickets, chargebacks, and angry replies.
One messy policy can poison a large part of your support flow. Refunds are just the obvious case. The same pattern shows up in shipping rules, cancellation windows, warranty terms, and account access.
Mistakes that keep bad answers alive
A weak knowledge base rarely breaks in obvious ways. It breaks quietly, when the assistant finds an old page, mixes it with a newer one, and gives a confident answer that sounds clean.
One common trap is keeping outdated pages live because someone might need them later. Retrieval does not understand office politics or nostalgia. If an old refund rule, pricing note, or security policy stays in the same searchable pool, it can still win.
Store historical material outside the assistant's source set. If a team needs it for record keeping, archive it somewhere people can reach but the model cannot cite as current guidance.
Another mistake is updating one page while leaving five related pages alone. A company edits the main policy document, but the FAQ, canned replies, training notes, and old help page still repeat the earlier rule. Now the assistant sees conflict everywhere.
A short check catches this fast:
- find every page that repeats the same rule in different words
- pick one current source for the rule
- update or retire the rest
- leave a clear owner for future changes
Ownership matters more than many teams think. When support, legal, product, and sales can all publish rules, drift starts almost at once. The assistant cannot tell which team had final say. It only sees competing text.
Testing too early creates a different problem. Teams ask three easy questions before cleanup is done, get decent answers, and assume the issue is fixed. Then a customer asks the same thing with different wording, and the assistant pulls a stale chunk that nobody removed.
Finish the cleanup, reindex, and then test hard cases. Change the wording, ask follow-up questions, and try cases where two policies used to conflict. One good answer proves very little. Ten varied answers after cleanup tell you much more.
Quick checks before you reindex
Reindexing too early locks old mistakes back into the system. A short review now saves hours of confusing answers later, especially when the assistant handles support or policy questions.
Start with the questions people ask all the time. For each one, pick a single approved page or document. If "refund window," "plan limits," or "shipping times" appears in three places, retrieval will still pull mixed answers unless one page clearly wins and the others get merged, redirected, or removed.
Before you reindex, confirm four things:
- every frequent question points to one approved source, not two similar pages with slightly different wording
- archived, replaced, and draft pages do not appear in the retrieval set
- dates, product names, plan names, and policy terms match across every page that survived cleanup
- a person tests the top support questions and checks the answers against the approved page, line by line if needed
That human check matters more than many teams expect. Retrieval can look clean in a spreadsheet and still fail in practice. A quick test with 10 or 20 real questions usually exposes the last problems: old naming, half-removed pages, or a policy paragraph copied from last year.
Use a simple rule. If two pages can answer the same question, one of them probably should not be indexed. Otherwise the cleanup is cosmetic, and bad answers keep slipping through.
What to do after the cleanup
After cleanup, you need habits that stop the mess from coming back. A clean index can drift fast if nobody owns it and nobody checks what the assistant says after reindexing.
Give every topic that affects customers, money, legal terms, or support work two labels: one owner and one review date. If the billing lead owns refunds and the support lead owns account access, people know who must approve edits and who must revisit the page before it goes stale.
Your team also needs one short update rule. Keep it plain: when a policy changes, update the source page first, archive or merge the old page, and then trigger reindexing. If people post changes in chat, docs, and tickets without updating the source, retrieval will start mixing old and new answers again.
A simple routine works well:
- assign an owner for each high-impact topic
- add a review date inside the document or in your content tracker
- reindex after approved changes, not before
- test real user questions and save weak answers for review
Testing matters more than it seems. After reindexing, ask the assistant the same questions customers and staff ask every week. Look for answers that sound confident but miss limits, dates, exceptions, or newer policy wording.
Keep a small log of weak answers. You do not need a huge dashboard at first. A shared sheet with the question, the bad answer, the page it used, and the fix is enough to spot patterns such as missing context, bad chunking, or a source page that still says two different things.
If the assistant still struggles after cleanup, the issue may sit in retrieval settings, document structure, or the wider AI rollout rather than the content alone.
If you need a second set of eyes, Oleg Sotnikov at oleg.is does this kind of review as a Fractional CTO and startup advisor. That can help when the content looks clean on paper, but the assistant still gives shaky answers in production.
Frequently Asked Questions
Why should I clean the knowledge base before changing prompts or models?
Clean the source first. If your docs disagree or repeat old rules, the assistant will pull the wrong page and answer with confidence no matter how you tune the prompt.
What should I review first?
Start with an inventory of every source the assistant can read. Include help articles, PDFs, FAQs, SOPs, shared docs, chat notes, and anything connected through search or sync.
How do I find duplicates when the titles look different?
Group pages by topic, not by title. Then open likely matches side by side and compare the intro, steps, dates, and screenshots to see where the same answer appears twice with small drift.
How can I tell if a page is stale?
Treat a page as stale when the product names, screenshots, steps, or policy terms no longer match current work. An old last-updated date is a clue, but outdated instructions matter more.
Should I delete old pages or just archive them?
Keep history, but move it out of the live retrieval set if it no longer describes current rules. People may still need old incident notes or migration docs, but the assistant should not quote them as current guidance.
What is the best way to fix policy conflicts?
Pull every version of the rule into one document and put the exact wording side by side. Then ask the policy owner to choose the current rule, rewrite it clearly, and remove every page that still says something else.
Should internal notes stay in the same index as customer docs?
No. Internal notes often include shortcuts, exceptions, or temporary advice that makes sense for staff but sounds wrong when the assistant says it to customers.
When should I reindex the content?
Wait until you finish one full cleanup round. If you reindex in the middle, the assistant can still pull half-fixed content and mix old claims with the new version.
How do I test the assistant after cleanup?
Ask real support questions in different wording, not just the easy version once. Check each answer against the approved source and look for mixed dates, missing limits, or old policy language.
What if the assistant still gives bad answers after the cleanup?
Then the issue may sit in retrieval settings, chunking, document structure, or the wider AI setup. That is a good time to bring in an experienced reviewer who can trace where the answer went wrong.