I am working on extracting structured legal information from judicial decisions/providencias. The dataset is mixed: some PDFs already contain selectable text, while others are scanned images and need OCR. Our current approach is evidence-first: upload documents privately, extract document summaries, page-level snippets/source spans, and structured legal facts, then only promote facts when they can be traced back to the original document, page, and fragment, with human review where needed. I would appreciate advice on the most accurate but cost-efficient architecture for this. My instinct is to first detect whether a PDF has usable text, use cheap text extraction when possible, run OCR only on scanned pages, then apply a strict structured extraction schema and a second verification step against the source spans. Are there specific OCR/layout tools, open-source pipelines, model-routing strategies, or benchmark methods you would recommend to maximize legal extraction quality while keeping costs low?