The n8n RAG Template Is Incredible. I Extended It With Document Processing So My Agent Can Read Actual Documents. 🔥

🔥

Community's 'RAG Chatbot' template: Amazing. But it assumed documents were already text. My client has 200 PDF contracts. Extended template to parse first.

THE BASE TEMPLATE LIMITATION:

Popular RAG templates expect clean text input. Text file → Embed → Store in Qdrant. Perfect IF your documents are already .txt files.

CLIENT'S REALITY:

200 PDF contracts. Scanned images. Multi-column layouts. Tables. Signatures. Stamps. Complex formatting. Can't just 'read file' - need actual parsing.

THE PROBLEM:

Tried feeding PDFs directly to embeddings. Got garbage: partial text, wrong order, missing tables, corrupted formatting.

RAG agent gave terrible answers because vector store contained broken text.

THE EXTENSION (DOCUMENT PREPROCESSING BRANCH):

ORIGINAL TEMPLATE FLOW:

Text File → Recursive Text Splitter → Embeddings Model → Qdrant Vector Store → RAG Agent

EXTENDED FLOW:

Google Drive Trigger (new PDF in folder) → Document Parser → Clean & Structure → Recursive Text Splitter → Embeddings Model → Qdrant Vector Store → RAG Agent

THE NEW NODES:

NODE 1 - GOOGLE DRIVE TRIGGER:

Watches 'Contracts to Process' folder. Triggers on new file.

NODE 2 - DOCUMENT PARSER NODE:

Handles scanned images, complex layouts, tables. Returns clean markdown with proper structure.

NODE 3 - SET NODE (CLEAN & STRUCTURE):

Takes parsed markdown, adds metadata (filename, date, document type), formats for chunking.

NODE 4 - RECURSIVE TEXT SPLITTER:

Same as original template. Chunks cleaned text into optimal sizes for embeddings.

Then connects to existing Qdrant and RAG agent nodes.

WHAT THIS HANDLES NOW:

- Scanned contracts (OCR built in)