🧐 Turn PDFs into Clean, LLM-Ready Data
PDFs lock content into complex layouts, making it difficult for LLMs to process text, tables, and images effectively.
Dolphin is an open source parsing framework that converts PDFs into structured formats such as Markdown, HTML, LaTeX, and JSON.
šŸ› ļø How It Works
  1. Layout analysis - Detects and sequences elements according to the document’s natural reading order.
  2. Parallel parsing - Processes each element with specialized prompts tailored to different content types (text blocks, tables, figures, etc.).
šŸ—ļø Key Features
  • Two-stage ā€œanalyze-then-parseā€ pipeline powered by a single VLM
  • Strong performance on complex document parsing tasks
  • Reading-order-aware element sequencing
  • Specialized prompts for different document elements
  • Efficient parallel parsing for faster results
It’s 100% Open Source šŸ™ŒšŸ»
16
12 comments
MiÅ”el Čupković
6
🧐 Turn PDFs into Clean, LLM-Ready Data
AI Automation Society
skool.com/ai-automation-society
A community built to master no-code AI automations. Join to learn, discuss, and build the systems that will shape the future of work.
Leaderboard (30-day)
Powered by