Automating timecard PDF → structured data (OCR + PDFVector + n8n)
Hi everyone 👋I’m building a SaaS where users currently upload an Excel file that is manually created from daily worker timecard PDFs. I’ve attached: - A sample timecard PDF (hard to read, scanned, inconsistent field names) - The final Excel output I need Goal Allow users to upload the PDF directly and automatically extract, per worker: - Date, Name - Time In - Lunch In / Out - Dinner In / Out - Wrap Time - Position - Department Challenges - Poor OCR quality - Field names vary across PDFs - Semi-structured tables - High accuracy required (payroll data) Plan - Use n8n for orchestration - Use PDFVector for PDF parsing / structured extraction - Add post-processing to normalize fields Questions 1. Is PDFVector reliable for row-level timecard extraction, or better as a helper only? 2. Best OCR + extraction approach for scanned timecards? 3. How would you design this pipeline for reliability at scale? Appreciate any guidance or real-world experience 🙏