Implement the page_type_string(class, ocr_succeeded, has_text, has_images) function that maps PageClass to canonical page_type strings for the 6.1 JSON schema per INV-9 stable taxonomy. Mapping table: - Vector → "text" - Scanned → "scanned" - Hybrid → "mixed" - BrokenVector + ocr_succeeded=false → "broken_vector" - BrokenVector + ocr_succeeded=true → "scanned" (post-OCR recovery) - Override: !has_text && !has_images → "blank" - Override: !has_text && has_images → "figure_only" Add comprehensive unit tests covering all 32 combinations (4 classes × 2 ocr_succeeded × 2 has_text × 2 has_images). Closes: pdftract-4t0jk |
||
|---|---|---|
| .. | ||
| pdftract-cer-diff | ||
| pdftract-cli | ||
| pdftract-core | ||
| pdftract-libpdftract | ||
| pdftract-py | ||