LangChain, 500-character chunks, 50-char overlap, recursive splitting. Technical documentation in the knowledge base.
I asked it to return a Flask endpoint for validating webhooks.
It returned code with the wrong function name, wrong endpoint, a missing line.
So I ran the diagnostic:
- I started with the generation layer. Checked whether the model had the right context and hallucinated anyway.
It didn't. The context itself was wrong. That ruled out generation.
- I checked retrieval. The retriever made the correct semantic match. Right section, right documentation. That ruled out retrieval.
- That left chunking. I went back to the indexed chunks. The function was split across three separate chunks. The model never received a complete unit of code. Just fragments.
Good retrieval. Wrong context. Broken chunking.
I caught this because I was running offline evaluation. I had ground truth, a labeled dataset, and an eval pipeline comparing outputs against expected answers. It surfaces this kind of failure immediately.
Four stages. That's how many points in a RAG system's lifecycle where you need evaluation: development, stress testing, production monitoring, regression testing.
The diagnostic I ran above is stage one. Most teams never build stages two, three, or four.
- Offline evaluation tells you what's broken now.
- Production monitoring (online evaluation) tells you when something new breaks.
Both are necessary. Skipping either means shipping in production a not robust system.