pdftract/crates
jedarden d1e4631eff feat(pdftract-1ijc): implement HOCR output parsing with quick-xml
Implement HOCR XML parser for Tesseract output (Phase 5.4.3).

- Add quick-xml dependency for streaming HOCR parsing
- Implement HocrWord struct with text, bbox_px, confidence_0_100 fields
- Implement parse_hocr() using quick-xml event-driven parsing
- Handle invalid UTF-8 gracefully (U+FFFD substitution)
- Skip empty/whitespace-only words
- Parse title attribute robustly (tolerates extra fields)
- Default confidence to 50% when x_wconf missing
- Add comprehensive test suite with performance benchmark

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:26:57 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli feat(pdftract-27n3): implement border padding, pipeline orchestration, and fixtures 2026-05-23 21:55:11 -04:00
pdftract-core feat(pdftract-1ijc): implement HOCR output parsing with quick-xml 2026-05-24 00:26:57 -04:00
pdftract-libpdftract feat(pdftract-juc): implement Standard 14 font metrics registry 2026-05-23 14:04:02 -04:00
pdftract-py docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00