# ๐Ÿ”’ Build Your Own Private Multimodal RAG System: The Complete Stack
## The Problem Nobody Talks About
When you upload documents to an AI service, you're placing a LOT of trust in that company:
- Trust that they'll keep your documents secure
- Trust that they won't use them to train their models
- Trust that they won't end up exposed in a data breach
For everyday documents, that's fine. But for **sensitive documents** - legal, medical, financial, client docs - that's a much bigger ask.
**For these, you need full control.**
---
## ๐ŸŽฏ The Solution: Fully Local Multimodal RAG
Today I'm sharing the complete stack we use for processing private documents with AI - **no external APIs, fully airgapped, available to your local network**.
### What is Multimodal RAG?
Retrieval across a knowledge base with **multiple data types**:
- Text documents
- PDFs with embedded images/tables
- Audio files (meeting transcripts)
- Videos
The magic? When you process a PDF with an embedded diagram, that diagram can be **retrieved and displayed** in your chat. Most AI agents only return text - we're going beyond that.
---
## ๐Ÿ› ๏ธ The Tech Stack
| Component | Purpose | Where it Runs |
|-----------|---------|---------------|
| **Docling** | Document processing (IBM open-source) | Local Docker / Modal GPU |
| **Qdrant** | Vector database | Local Docker |
| **Static File Server** | Serve extracted images | Local Docker (nginx) |
| **Ollama** | Local LLM & VLM | Local Docker (GPU) |
| **N8N** | Orchestration | Local Docker |
| **Modal** | Scalable GPU processing | Cloud (when needed) |
---
## ๐Ÿ“„ Why Docling is a Game-Changer
Docling is IBM's open-source document processing library. Feed it:
- PDFs
- Word docs
- PowerPoint presentations
- Images
- Audio files
And it outputs **clean structured markdown or JSON** your agent can search over.
### This isn't basic text extraction:
โœ… Recognizes headers and document structure
โœ… Extracts tables accurately
โœ… Pulls out diagrams as images
โœ… Text in diagrams is searchable
โœ… Maintains semantic structure (bullet points, lists)
### Two Processing Pipelines:
**1. Standard Pipeline (Recommended)**
- Specialized non-generative AI models
- No hallucinations - copies text verbatim
- Fast and reliable
- Works offline
**2. VLM Pipeline**
- Uses Vision Language Models (like deepseek-ocr)
- Better for complex layouts
- Can describe images with AI
- Requires more GPU power
---
## ๐Ÿš€ Two Deployment Options
### Option 1: Fully Local (Air-gapped)
Everything runs on your hardware via Docker Compose. Zero cloud dependencies.
**What you get:**
- Docling for document processing
- Qdrant for vector search
- Nginx for serving extracted images
- Ollama for local AI models
- N8N for workflow orchestration
**Pros:**
- Complete privacy
- No recurring costs
- Works offline
**Cons:**
- Requires GPU hardware upfront ($1,600-$2,000 for RTX 4090/5090)
- Limited concurrent users
---
### Option 2: Modal for Scalable Processing
When you need GPU power without buying hardware, Modal.com is perfect for burst processing.
**How it works:**
- Deploy Docling + Ollama on Modal's A10G GPU (24GB VRAM)
- Process documents via API
- Pay only for what you use
- Scales automatically
**Pros:**
- No hardware investment
- Scales automatically
- Pay per use
**Cons:**
- Documents leave your network briefly
- Per-request costs
---
## ๐Ÿ–ผ๏ธ The Multimodal Magic: Serving Images
Here's where it gets cool. When Docling extracts images from PDFs:
1. **Images saved to a shared folder**
2. **Nginx serves them via HTTP**
3. **URLs included in the vector store**
Now when your agent retrieves a chunk about "cabinet dimensions", it can **show the actual diagram** in the chat!
---
## ๐Ÿง  Context Expansion: The Secret Sauce
The biggest reason RAG agents fail? **They can't see the big picture.**
They retrieve isolated chunks without understanding document structure.
### The Problem:
**Query:** "Is tennis elbow covered under this policy?"
**Retrieved chunk:** "Tennis elbow treatment includes..."
**Agent thinks:** "Yes, it's covered!" โŒ
**Reality:** That chunk was under "Policy Exclusions" heading ๐Ÿ˜ฌ
### The Solution: Document Hierarchy
When ingesting documents, we extract the **full structure** - headings, sections, and their relationships.
Now the agent can:
1. Retrieve candidate chunks
2. Fetch the document hierarchy
3. Expand context to include the **parent section**
4. Generate accurate answers
### Expansion Strategies:
| Strategy | When to Use |
|----------|-------------|
| **Full Document** | Small docs (<10 pages) |
| **Neighbor Chunks** | Quick context boost |
| **Section Expansion** | Structured documents |
| **Parent Expansion** | Need broader context |
| **Agentic Expansion** | Multiple sections needed |
---
## ๐Ÿ”ง The Complete Ingestion Pipeline
1. **File Trigger** โ†’ Watch a folder for new documents
2. **Docling Processing** โ†’ OCR + structure extraction
3. **Hierarchy Extraction** โ†’ Map headings to chunks
4. **Smart Chunking** โ†’ Split by markdown, then by size
5. **Image Extraction** โ†’ Move to static server
6. **Embedding Generation** โ†’ Create vectors (Nomic)
7. **Vector Store** โ†’ Upsert to Qdrant
8. **Record Manager** โ†’ Track what's been processed
### Key Insights:
- **Use OCR** - Native PDF extraction doesn't preserve headings
- **Split by Markdown first** - Then recursive character split
- **Merge tiny chunks** - They pollute the vector store
- **Add contextual snippets** - "This chunk is from Section 2.3..."
- **Track page numbers** - Critical for traceability
---
## ๐Ÿ’ป Hardware Requirements
For running local VLMs comfortably:
| GPU | VRAM | Max Model Size | Price |
|-----|------|----------------|-------|
| RTX 4090 | 24GB | ~35B params | ~$1,600 |
| RTX 5090 | 32GB | ~40B params | ~$2,000 |
| Apple M3 Max | 48GB | ~40B params | ~$3,500 |
**Pro Tip:** You don't need this hardware to BUILD your system. Use Modal or OpenRouter for development, then deploy locally when ready.
---
## ๐ŸŽฌ The Workflow in Action
1. **Drop PDF into watched folder**
2. **Docling processes it** (46 seconds for 112 pages!)
3. **Images extracted to static server**
4. **Chunks embedded and stored in Qdrant**
5. **User asks: "Show me the installation diagram"**
6. **Agent retrieves chunks + fetches hierarchy**
7. **Expands context to full section**
8. **Returns answer WITH embedded images** ๐ŸŽ‰
---
## ๐ŸŒ Making it Available on Your Network
To share with your team:
1. **Configure firewall** - Allow inbound connections on required ports
2. **Set static IP** - So the address doesn't change
3. **Update service URLs** - Point to your server
4. **Keep server running** - During office hours at minimum
Your team can then access the chat interface from any device on the network!
---
## ๐Ÿ”‘ Key Takeaways
1. **Privacy is possible** - You CAN run powerful AI locally
2. **Docling is incredible** - IBM's gift to document processing
3. **Multimodal matters** - Images in responses = 10x better UX
4. **Context expansion is crucial** - Don't let chunks lose their meaning
5. **Hybrid approach works** - Local for production, Modal for development
---
## ๐Ÿ“š Resources
- **Docling GitHub**: github.com/DS4SD/docling
- **Docling Serve**: github.com/DS4SD/docling-serve
- **Qdrant**: qdrant.tech
- **Ollama**: ollama.com
---
## ๐Ÿš€ Next Steps
1. Clone the starter kit
2. Run Docker Compose to spin up all services
3. Access Docling UI
4. Drop in your first PDF
5. Ask your agent about it!
**Questions? Drop them below!** ๐Ÿ‘‡
#AI #RAG #Privacy #Docling #LocalAI #N8N #Modal #Qdrant
7
4 comments
Hicham Char
7
# ๐Ÿ”’ Build Your Own Private Multimodal RAG System: The Complete Stack
AI Automation Society
skool.com/ai-automation-society
A community built to master no-code AI automations. Join to learn, discuss, and build the systems that will shape the future of work.
Leaderboard (30-day)
Powered by