Why RAG projects fail in month two after a good demo
Why RAG projects fail often shows up in month two, when search misses answers, documents age, and teams skip quality checks.

Why the demo stops helping people
Most RAG tools look better in a demo than they do a month later. The reason is simple: demos happen in clean conditions. The test set is small, the files are familiar, and the questions match the wording in the documents.
Real use is rougher. People ask vague questions, use old project names, mix slang with internal terms, and leave out context they assume the bot already knows. A demo survives neat prompts. Daily work does not.
The first drop in quality usually comes from retrieval, not generation. If the system pulls the wrong chunks, the model can still produce a polished answer. That makes the failure harder to catch. The reply sounds right until someone follows it and finds out it was based on the wrong source.
Freshness is the next problem. Internal documents change all the time. New policies, meeting notes, runbooks, pricing rules, and product updates show up every week. If the index falls behind, the bot answers with last month's truth. Users rarely forgive that twice.
The early warning signs are easy to miss. People start asking the same question three or four ways. They paste full document titles because normal wording stops working. They open the source material before trusting any answer. Then a few wrong replies turn into a quiet habit of avoiding the tool.
Trust drops faster than accuracy. One bad answer about vacation policy annoys people. One bad answer about pricing, security steps, or a customer promise changes how they treat the whole system.
A common month-two example looks like this: a company launches a support assistant trained on product docs and old tickets. In the demo, it resolves common questions in seconds. Six weeks later, the product team has shipped updates, support has renamed a workflow, and a new billing rule sits in a PDF nobody indexed. The bot still answers fast, but it now mixes old and new rules. Agents start asking coworkers instead.
The system did not suddenly get stupid. The company changed, the questions got messier, and retrieval did not keep up. Once users lose trust, even decent answers stop mattering.
How weak retrieval shows up
Weak retrieval often looks fine in testing and wrong in daily use. People ask normal questions, get polished replies, and still walk away with the wrong action.
A common failure is near-match search. The tool finds a document that shares the same words but not the same meaning. Someone asks about the current expense approval limit, and the model pulls an older travel policy because it mentions approvals, managers, and receipts. The answer sounds close enough to pass a quick glance, which is exactly why the mistake slips through.
Chunking causes another quiet problem. One chunk ends with the rule, the next starts with the exception, and retrieval grabs only half the picture. The model fills the gap with a smooth guess. Users do not see the missing context. They only see an answer that feels certain.
Metadata filters can make this worse. Teams add filters for department, region, product line, or date so results stay tidy. That seems sensible until the filter blocks the one document that actually answers the question. Then the tool searches a smaller box and returns the best thing inside it, not the right thing.
Broken internal assistants tend to show the same patterns. They answer broad questions better than specific ones. They return the right topic but the wrong version. They miss exceptions, edge cases, and recent changes. And they still sound confident when the source is a poor fit.
Similarity scores do not solve this. A score can look good because the text is lexically close while the answer still misses the point. That is why a promising demo tells you very little about month two. Handpicked questions can make retrieval look smart. Real users ask messy questions with missing context, internal jargon, and half-remembered terms.
The fastest way to spot weak retrieval is to read the source snippets, not the final answer. If the snippets feel adjacent to the question instead of directly useful, the tool is already drifting.
A simple test helps. Ask five people for real questions they asked this week. Then inspect what the system retrieved before judging the wording of the reply. If the wrong documents keep showing up, do not tune the prompt first. Fix retrieval, or the same failure will keep returning with nicer phrasing.
How stale content sneaks in
If you want to know why a RAG tool slips after launch, watch what happens to the documents. The model stays the same, but the company keeps changing. Policies, product notes, pricing rules, and support steps move every week.
That is how a stale knowledge base forms. Nobody plans it. It grows from small gaps between where the latest document lives and what the retriever can still see.
A common pattern is boring and damaging at the same time. Someone updates a policy in the shared drive, but the ingestion job only runs on one folder or fails quietly. The new file never reaches embeddings, so the bot keeps pulling chunks from last month's version. Users ask a normal question and get a confident answer from an old rule.
Old content also survives longer than teams expect. A page gets replaced, but nobody removes the old chunks from the index. A deleted handbook page still appears because the vector store kept its embeddings, or a crawler copied the page before someone archived it. The source is gone, but the answer still lives inside retrieval.
Version control makes this worse when document owners skip basic labels. If two files have nearly the same title and nobody marks which one is current, retrieval may rank the older one higher because it contains more matching words. The bot is not choosing the right policy. It is choosing the chunk that looks most similar.
Most stale answers trace back to a small set of causes. Teams publish new docs but never trigger reindexing. They delete pages in the source system but keep old embeddings. They rename files without marking a clear current version. Ownership is spread across teams, so nobody checks what the bot actually reads. Or they treat the demo index as if it will stay useful on its own.
A small support example makes this obvious. Finance changes the travel reimbursement cap on Monday. The updated PDF sits in the right folder, but the connector misses it. On Wednesday, the bot still quotes the old cap from a chunk created three months ago. One wrong answer like that can erase weeks of trust.
The fix is not glamorous, which is why teams skip it. Every document in retrieval needs an owner, an effective date, a source of truth, and a rule for updates and removal. If a page changes, the index must change. If a page dies, its chunks must disappear too.
This is often the first thing an internal AI audit finds. Weak retrieval gets attention because it sounds technical. Stale content slips past people because the demo still sounds smart, right up until someone follows the wrong rule.
Set up a simple evaluation routine
A RAG demo can look good with five handpicked prompts. After launch, people ask messy questions, use old terms, and expect answers from the right policy or document version. The problem is not only bad retrieval. It is also the lack of a routine that tells you when retrieval starts slipping.
Build a small test set from real questions and run it every week. Start with 20 to 50 questions pulled from support chats, Slack threads, tickets, or search logs. Keep them plain and realistic. For each question, note the document or section the answer should use. If more than one source is acceptable, record that too. This turns vague feedback like "the bot feels worse lately" into something you can measure.
Score retrieval and answers separately
Treat retrieval and answer generation as two different steps. If you lump them together, you hide the real problem.
For each question, check four things: whether retrieval fetched the right document near the top, whether the answer used that document correctly, whether any citation points to the expected source instead of a similar but wrong page, and why the system failed when it missed. The reason matters. Bad chunking, weak query rewriting, stale content, and prompt drift need different fixes.
This split saves time. If retrieval fails, prompt tuning will not help much. If retrieval works but the answer still goes wrong, the model may summarize badly, skip a condition, or blend two documents together.
Keep the test set current
Review misses every week. A 20-minute session is often enough. Look for patterns, not one bad output. Maybe finance questions fail because the index splits tables badly. Maybe HR answers drift because the bot still pulls last quarter's policy.
Update the test set when content changes. If your team replaces a handbook, renames a process, or adds a new pricing rule, add a few fresh questions that cover it. Remove prompts that no longer match real work. A stale test set can tell you everything is fine while the tool gets worse in daily use.
You do not need a fancy setup. A shared sheet with questions, expected sources, retrieval results, answer scores, and notes is enough to start. Think of it as a lightweight internal AI audit. If one person owns it each week, you will catch retrieval problems and stale content before users stop trusting the tool.
A realistic month-two support bot example
An internal HR bot often looks good in a demo because the test questions are clean and the source files are fresh. A month later, normal work exposes the cracks.
Picture a company that launched a bot for simple HR questions. Staff use it to ask about paid leave, carryover days, and approval rules. During the first week, the bot seems reliable. It pulls answers from the employee handbook and gives short, clear replies.
Then HR updates the paid leave policy. The company changes the carryover rule, so employees can keep fewer unused days than last quarter. HR uploads the new policy to a shared folder, but nobody checks whether the bot now retrieves that file first.
The bot keeps finding the old handbook page because that page ranks higher in search. Maybe it uses the exact words people type. Maybe it was split into cleaner chunks. Maybe the new document sits in a folder the index missed. The result is simple: the answer sounds confident, but it is wrong.
A few employees trust the bot and plan their time off around the old rule. One manager notices the mismatch only after a team member quotes the bot in a message. By then, staff have already found the error before the team that owns the tool does.
That moment hurts more than the bad answer itself. People stop treating the bot like a useful shortcut and start treating it like a risk. Even after HR fixes the source and reindexes the files, many employees keep double-checking every answer with a human. Some stop using the bot altogether.
Trust drops fast because people remember the mistake in a personal way. Paid leave affects plans, family trips, and payroll questions. If the bot gets that wrong once, employees assume it might get other policy questions wrong too.
A tiny weekly check would have caught this early. Ask five real paid leave questions, inspect the retrieved chunks, confirm the newest policy appears ahead of the old handbook page, and flag any answer that cites outdated wording.
That is what month two looks like for many internal AI tools. The demo proves the bot can answer a question. It does not prove the bot can stay right after policies change, files move, and people ask the same thing in messy everyday language.
Mistakes teams make after launch
Most teams relax right after the first good demo. They tested the bot with a small set of clean questions on documents they knew were easy to retrieve, and everyone left the meeting impressed. Real employees do not ask like that.
The first mistake is grading answers by tone instead of truth. A smooth, confident reply feels correct, so teams mark it as a win. But the real check is much simpler: did the system retrieve the right source, and did it answer from that source without stretching beyond it?
A polite wrong answer still wastes time. Sometimes it creates more work than no answer at all.
The next mistake is letting content get messy. Someone adds draft docs, old meeting notes, copied policy pages, and half-finished guides into the same index as approved material. Then the bot pulls the wrong chunk and gives an answer that sounds reasonable while pointing people to a rule that was never approved.
A stale knowledge base does not always look stale on the surface. Sometimes only a small share of the content is old, but that small share hits the questions people ask most.
Ownership is another weak spot. Product assumes engineering will keep the content clean. Engineering assumes the department that owns the docs will flag changes. Nobody has a calendar reminder, a checklist, or even a named person for updates. So the index drifts for weeks.
One rule helps: every document collection needs an owner, and every answer type needs a trusted source. If nobody owns it, the bot will guess between mixed materials.
Teams also wait for complaints instead of testing on purpose. That sounds harmless, but most employees do not report bad answers. They try the tool twice, lose trust, and go back to Slack or email.
For many internal tools, a basic weekly review is enough. Run the same 15 to 20 real questions. Inspect the retrieved sources, not just the wording of the final reply. Remove drafts and duplicates from the index. Update documents that changed that week. Log misses and watch for repeats.
This does not need a big process. It is closer to a small internal AI audit than a formal program. If the team skips that habit, retrieval problems stay hidden until adoption drops and nobody believes the bot anymore.
Quick checks for this week
A month-two review does not need a big project plan. You need an hour, a few real questions, and the discipline to inspect what the system actually used.
Start with five questions from the last week. Pull them from support chats, ticket threads, or employee messages. Real wording matters because users rarely ask things the same way your team did during testing, and that is where retrieval problems start to show.
For each question, open the exact chunks the model retrieved. Read them without giving the model any credit for sounding confident. If the chunks are vague, half-related, or missing the policy detail that matters, the answer is already on shaky ground.
Then check the source documents behind those chunks. Two details tell you a lot: when the document changed last and who owns it now. If nobody can answer both, you probably already have a stale knowledge base.
A quick pass usually exposes the same issues. A newer policy exists, but the retriever still pulls last quarter's version. Two documents say nearly the same thing with small conflicts. A retired page still ranks because it contains common words from old queries. One useful document is split into chunks so small that the answer loses context. Or nobody knows who should fix a bad source when you find one.
Clean up the obvious mess first. Remove duplicate pages that say the same thing in slightly different ways. Archive retired content so it cannot compete with current material. If a document is still active but confusing, rewrite the source before you tune prompts or swap models.
Keep one shared log for every miss. A simple sheet, issue board, or chat thread works if the whole team uses it. Each entry should capture the user question, the answer given, the chunks retrieved, and what should have happened instead. That becomes your evaluation routine, even if it starts small.
After ten or fifteen logged misses, patterns become hard to ignore. You will see whether the problem is weak retrieval, stale content, bad chunking, or missing ownership.
If you do only one thing this week, do this review with real user questions. Teams often spend weeks tuning prompts when the real fix is deleting old documents and assigning one owner to each source.
What to do before the tool slips further
When a RAG assistant starts missing obvious answers, teams often reach for prompt changes first. That usually wastes time. Fix retrieval before you rewrite prompts, because even a polished prompt fails when search brings back the wrong chunks.
Pull a small set of real failures from the last one or two weeks and inspect each case by hand. Did the system find the right document? Was that document current? Did the answer follow it closely? This quick check tells you whether you have a retrieval problem, a stale knowledge problem, or both.
Give content freshness to one person, not a shared inbox. One owner should track source documents, update dates, and pages that need review. When nobody owns freshness, old policies and half-finished docs stay in the index for months.
A weekly review helps too, but keep it plain. Pick 20 common questions and score them with pass or fail rules that anyone on the team can apply in a few minutes. Pass if the right source appears near the top results and the answer matches the current document. Fail if the answer uses an old policy, cites retired steps, sounds certain when the source is missing, or repeats the same bad answer twice in one week.
Many teams fail after a good demo because they expand coverage before the basics are stable. A better path is smaller and a bit boring. Keep the bot focused on one workflow, such as HR policies or support replies, until the weekly score stays solid for several weeks in a row.
That makes debugging easier too. If accuracy drops, you can inspect one slice of content, one group of users, and one set of questions. If you feed the system every document in the company too early, every mistake takes longer to trace.
If your team is stuck, an outside review can save a lot of wasted effort. Oleg Sotnikov at oleg.is works as a Fractional CTO and startup advisor, and he helps teams review RAG architecture, evaluation habits, and AI-first internal tooling in a practical way.
Act before people stop trusting the tool. Once staff decide the bot is unreliable, they stop checking it, and winning them back is much harder than fixing the system early.
Frequently Asked Questions
Why does a RAG tool look good in a demo and then slip a few weeks later?
Demos use clean questions, familiar files, and fresh content. Real users ask messy questions, use old names, and leave out context.
Month two fails when retrieval cannot keep up with changing documents. The bot still sounds polished, but it starts answering from the wrong source or an old rule.
What usually goes wrong first: retrieval or generation?
Retrieval usually breaks first. If search pulls the wrong chunk, the model can still write a smooth answer, which makes the miss harder to spot.
Check the retrieved snippets before you judge the wording. If those snippets do not answer the question directly, prompt changes will not fix the real issue.
How can I tell if retrieval is weak?
Look for behavior changes from users. People rephrase the same question several times, paste full document titles, or open the source before they trust the answer.
You can also inspect recent misses. If the system keeps finding related documents instead of the right one, retrieval is drifting.
Why does stale content damage trust so quickly?
One wrong answer about a policy or pricing rule changes how people treat the whole tool. Users remember the risk more than the average accuracy.
That is why stale content hurts so much. The bot answers fast, but it quotes last month's rule, and people stop trusting it.
Should we tune prompts first when answers start getting worse?
No. Start with retrieval and content freshness.
Open a few failed cases by hand. Confirm that the system found the right document, that the document is current, and that old chunks no longer appear. Tune prompts only after those checks pass.
How often should we test an internal RAG assistant?
Run a small review every week. Use 20 to 50 real questions from chats, tickets, or search logs, not demo prompts.
That routine catches drift early. You do not need a large setup; a shared sheet with questions, expected sources, and notes works fine.
What should we measure in a weekly RAG review?
Score retrieval and the final answer as two separate steps. First check whether the right document appears near the top. Then check whether the answer follows that document closely.
Also note why a miss happened. Weak chunking, stale content, bad filters, and poor query rewriting need different fixes.
How do chunking and filters create wrong answers?
Chunking can split a rule from its exception. Search then retrieves only half the story, and the model fills the gap with a guess.
Filters can hide the only document that answers the question. When that happens, the system returns the best result inside a small box, not the right one.
Who should own the knowledge base behind the bot?
Assign one person to each document collection. That person should track the current file, update dates, and removal of old content.
If nobody owns freshness, old pages stay in the index, new files miss reindexing, and the bot starts mixing versions.
When does it make sense to ask for an outside review?
Bring in outside help when your team keeps fixing wording while the same failures return. A fresh review can spot weak retrieval, stale sources, and missing checks much faster.
If you need that kind of review, Oleg Sotnikov can look at your RAG setup, evaluation routine, and internal AI workflow in a practical way.