Commit graph

1 commit

Author SHA1 Message Date
jedarden
fce3a75526 feat(pdftract-4t0jk): implement page_type_string mapping table
Implement the page_type_string(class, ocr_succeeded, has_text, has_images)
function that maps PageClass to canonical page_type strings for the 6.1
JSON schema per INV-9 stable taxonomy.

Mapping table:
- Vector → "text"
- Scanned → "scanned"
- Hybrid → "mixed"
- BrokenVector + ocr_succeeded=false → "broken_vector"
- BrokenVector + ocr_succeeded=true → "scanned" (post-OCR recovery)
- Override: !has_text && !has_images → "blank"
- Override: !has_text && has_images → "figure_only"

Add comprehensive unit tests covering all 32 combinations (4 classes ×
2 ocr_succeeded × 2 has_text × 2 has_images).

Closes: pdftract-4t0jk
2026-05-25 01:19:58 -04:00