Commit graph

1 commit

Author SHA1 Message Date
jedarden
d1e4631eff feat(pdftract-1ijc): implement HOCR output parsing with quick-xml
Implement HOCR XML parser for Tesseract output (Phase 5.4.3).

- Add quick-xml dependency for streaming HOCR parsing
- Implement HocrWord struct with text, bbox_px, confidence_0_100 fields
- Implement parse_hocr() using quick-xml event-driven parsing
- Handle invalid UTF-8 gracefully (U+FFFD substitution)
- Skip empty/whitespace-only words
- Parse title attribute robustly (tolerates extra fields)
- Default confidence to 50% when x_wconf missing
- Add comprehensive test suite with performance benchmark

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:26:57 -04:00