I built a RAG system to reproduce a failure pattern I'd seen developers describe repeatedly.
LangChain, 500-character chunks, 50-char overlap, recursive splitting. Technical documentation in the knowledge base.
I asked it to return a Flask endpoint for validating webhooks.
It returned code with the wrong function name, wrong endpoint, a missing line.
So I ran the diagnostic:
- I started with the generation layer. Checked whether the model had the right context and hallucinated anyway.
It didn't. The context itself was wrong. That ruled out generation.
- I checked retrieval. The retriever made the correct semantic match. Right section, right documentation. That ruled out retrieval.
- That left chunking. I went back to the indexed chunks. The function was split across three separate chunks. The model never received a complete unit of code. Just fragments.
Good retrieval. Wrong context. Broken chunking.
I caught this because I was running offline evaluation. I had ground truth, a labeled dataset, and an eval pipeline comparing outputs against expected answers. It surfaces this kind of failure immediately.
Four stages. That's how many points in a RAG system's lifecycle where you need evaluation: development, stress testing, production monitoring, regression testing.
The diagnostic I ran above is stage one. Most teams never build stages two, three, or four.
- Offline evaluation tells you what's broken now.
- Production monitoring (online evaluation) tells you when something new breaks.
Both are necessary. Skipping either means shipping in production a not robust system.
0
0 comments
Davide Falco
1
I built a RAG system to reproduce a failure pattern I'd seen developers describe repeatedly.
powered by
AI Production Lab
skool.com/ai-engineering-1704
Most developers can build AI systems that works in demos. Few know what to do when it breaks in production. Here is where you learn the difference.
Build your own community
Bring people together around your passion and get paid.
Powered by