I built a RAG system to reproduce a failure pattern I'd seen developers describe repeatedly.
LangChain, 500-character chunks, 50-char overlap, recursive splitting. Technical documentation in the knowledge base. I asked it to return a Flask endpoint for validating webhooks. It returned code with the wrong function name, wrong endpoint, a missing line. So I ran the diagnostic: - I started with the generation layer. Checked whether the model had the right context and hallucinated anyway. It didn't. The context itself was wrong. That ruled out generation. - I checked retrieval. The retriever made the correct semantic match. Right section, right documentation. That ruled out retrieval. - That left chunking. I went back to the indexed chunks. The function was split across three separate chunks. The model never received a complete unit of code. Just fragments. Good retrieval. Wrong context. Broken chunking. I caught this because I was running offline evaluation. I had ground truth, a labeled dataset, and an eval pipeline comparing outputs against expected answers. It surfaces this kind of failure immediately. Four stages. That's how many points in a RAG system's lifecycle where you need evaluation: development, stress testing, production monitoring, regression testing. The diagnostic I ran above is stage one. Most teams never build stages two, three, or four. - Offline evaluation tells you what's broken now. - Production monitoring (online evaluation) tells you when something new breaks. Both are necessary. Skipping either means shipping in production a not robust system.