pdftract

History

jedarden fce3a75526 feat(pdftract-4t0jk): implement page_type_string mapping table Implement the page_type_string(class, ocr_succeeded, has_text, has_images) function that maps PageClass to canonical page_type strings for the 6.1 JSON schema per INV-9 stable taxonomy. Mapping table: - Vector → "text" - Scanned → "scanned" - Hybrid → "mixed" - BrokenVector + ocr_succeeded=false → "broken_vector" - BrokenVector + ocr_succeeded=true → "scanned" (post-OCR recovery) - Override: !has_text && !has_images → "blank" - Override: !has_text && has_images → "figure_only" Add comprehensive unit tests covering all 32 combinations (4 classes × 2 ocr_succeeded × 2 has_text × 2 has_images). Closes: pdftract-4t0jk		2026-05-25 01:19:58 -04:00
..
pdftract-cer-diff	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
pdftract-cli	feat(pdftract-5pbkp): implement inspect subcommand with clap parsing and axum server	2026-05-24 17:13:05 -04:00
pdftract-core	feat(pdftract-4t0jk): implement page_type_string mapping table	2026-05-25 01:19:58 -04:00
pdftract-libpdftract	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter	2026-05-24 04:57:17 -04:00
pdftract-py	feat(pdftract-2nu0s): implement Python SDK contract conformance	2026-05-24 08:55:11 -04:00