Community's 'RAG Chatbot' template: Amazing. But it assumed documents were already text. My client has 200 PDF contracts. Extended template to parse first.
THE BASE TEMPLATE LIMITATION:
Popular RAG templates expect clean text input. Text file → Embed → Store in Qdrant. Perfect IF your documents are already .txt files.
CLIENT'S REALITY:
200 PDF contracts. Scanned images. Multi-column layouts. Tables. Signatures. Stamps. Complex formatting. Can't just 'read file' - need actual parsing.
THE PROBLEM:
Tried feeding PDFs directly to embeddings. Got garbage: partial text, wrong order, missing tables, corrupted formatting.
RAG agent gave terrible answers because vector store contained broken text.
THE EXTENSION (DOCUMENT PREPROCESSING BRANCH):
ORIGINAL TEMPLATE FLOW:
Text File → Recursive Text Splitter → Embeddings Model → Qdrant Vector Store → RAG Agent
EXTENDED FLOW:
Google Drive Trigger (new PDF in folder) → Document Parser → Clean & Structure → Recursive Text Splitter → Embeddings Model → Qdrant Vector Store → RAG Agent
THE NEW NODES:
NODE 1 - GOOGLE DRIVE TRIGGER:
Watches 'Contracts to Process' folder. Triggers on new file.
NODE 2 - DOCUMENT PARSER NODE:
Handles scanned images, complex layouts, tables. Returns clean markdown with proper structure.
NODE 3 - SET NODE (CLEAN & STRUCTURE):
Takes parsed markdown, adds metadata (filename, date, document type), formats for chunking.
NODE 4 - RECURSIVE TEXT SPLITTER:
Same as original template. Chunks cleaned text into optimal sizes for embeddings.
Then connects to existing Qdrant and RAG agent nodes.
WHAT THIS HANDLES NOW:
- Scanned contracts (OCR built in)
- Multi-column PDFs (proper reading order)
- Documents with tables (preserves structure)
- Signatures and stamps (recognized, not noise)
- Word documents (converts automatically)
- Images of documents
THE RESULTS:
- 200 contracts processed: 3 hours (vs 2 weeks manual text extraction)
- RAG agent accuracy: 94% on contract queries
- Handles complex layouts: tables, signatures, stamps
- Vector store quality: Massive improvement
SPECIFIC EXAMPLE:
Query: 'Which contracts have auto-renewal clauses?'
Before extension: Agent couldn't answer (PDFs unreadable)
After extension: Agent found 23 contracts, listed them with clause details
CONFIGURATION DETAILS:
Document parser: Parse mode with LLM enhancement for complex layouts. Returns markdown format.
Metadata: Adds contract_name, signed_date, parties, document_id. Critical for RAG retrieval.
THE LESSON:
RAG templates are incredible. But real documents need preprocessing. Adding document parsing transforms RAG from demo to production.
TEMPLATE:
Complete extended workflow. Document preprocessing branch connects seamlessly to any RAG template. Import, configure folder, deploy.
How did you solve the document format problem in your RAG workflows?