Hi everyone 👋I’m building a SaaS where users currently upload an Excel file that is manually created from daily worker timecard PDFs.
I’ve attached:
- A sample timecard PDF (hard to read, scanned, inconsistent field names)
- The final Excel output I need
Goal
Allow users to upload the PDF directly and automatically extract, per worker:
- Date, Name
- Time In
- Lunch In / Out
- Dinner In / Out
- Wrap Time
- Position
- Department
Challenges
- Poor OCR quality
- Field names vary across PDFs
- Semi-structured tables
- High accuracy required (payroll data)
Plan
- Use n8n for orchestration
- Use PDFVector for PDF parsing / structured extraction
- Add post-processing to normalize fields
Questions
- Is PDFVector reliable for row-level timecard extraction, or better as a helper only?
- Best OCR + extraction approach for scanned timecards?
- How would you design this pipeline for reliability at scale?
Appreciate any guidance or real-world experience 🙏