pdftract/crates
jedarden fce3a75526 feat(pdftract-4t0jk): implement page_type_string mapping table
Implement the page_type_string(class, ocr_succeeded, has_text, has_images)
function that maps PageClass to canonical page_type strings for the 6.1
JSON schema per INV-9 stable taxonomy.

Mapping table:
- Vector → "text"
- Scanned → "scanned"
- Hybrid → "mixed"
- BrokenVector + ocr_succeeded=false → "broken_vector"
- BrokenVector + ocr_succeeded=true → "scanned" (post-OCR recovery)
- Override: !has_text && !has_images → "blank"
- Override: !has_text && has_images → "figure_only"

Add comprehensive unit tests covering all 32 combinations (4 classes ×
2 ocr_succeeded × 2 has_text × 2 has_images).

Closes: pdftract-4t0jk
2026-05-25 01:19:58 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli feat(pdftract-5pbkp): implement inspect subcommand with clap parsing and axum server 2026-05-24 17:13:05 -04:00
pdftract-core feat(pdftract-4t0jk): implement page_type_string mapping table 2026-05-25 01:19:58 -04:00
pdftract-libpdftract feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
pdftract-py feat(pdftract-2nu0s): implement Python SDK contract conformance 2026-05-24 08:55:11 -04:00