I built a RAG system to reproduce a failure pattern I'd seen developers describe repeatedly.

Davide Falco

2d • General discussion

LangChain, 500-character chunks, 50-char overlap, recursive splitting. Technical documentation in the knowledge base.

I asked it to return a Flask endpoint for validating webhooks.

It returned code with the wrong function name, wrong endpoint, a missing line.

So I ran the diagnostic:

- I started with the generation layer. Checked whether the model had the right context and hallucinated anyway.

It didn't. The context itself was wrong. That ruled out generation.

- I checked retrieval. The retriever made the correct semantic match. Right section, right documentation. That ruled out retrieval.

- That left chunking. I went back to the indexed chunks. The function was split across three separate chunks. The model never received a complete unit of code. Just fragments.

Good retrieval. Wrong context. Broken chunking.

I caught this because I was running offline evaluation. I had ground truth, a labeled dataset, and an eval pipeline comparing outputs against expected answers. It surfaces this kind of failure immediately.

Four stages. That's how many points in a RAG system's lifecycle where you need evaluation: development, stress testing, production monitoring, regression testing.

The diagnostic I ran above is stage one. Most teams never build stages two, three, or four.

- Offline evaluation tells you what's broken now.

- Production monitoring (online evaluation) tells you when something new breaks.

Both are necessary. Skipping either means shipping in production a not robust system.

0 comments

AI Production Lab

skool.com/ai-engineering-1704

Most developers can build AI systems that works in demos. Few know what to do when it breaks in production. Here is where you learn the difference.

Bring people together around your passion and get paid.