pdftract

History

jedarden d1e4631eff feat(pdftract-1ijc): implement HOCR output parsing with quick-xml Implement HOCR XML parser for Tesseract output (Phase 5.4.3). - Add quick-xml dependency for streaming HOCR parsing - Implement HocrWord struct with text, bbox_px, confidence_0_100 fields - Implement parse_hocr() using quick-xml event-driven parsing - Handle invalid UTF-8 gracefully (U+FFFD substitution) - Skip empty/whitespace-only words - Parse title attribute robustly (tolerates extra fields) - Default confidence to 50% when x_wconf missing - Add comprehensive test suite with performance benchmark Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-24 00:26:57 -04:00
..
pdftract-cer-diff	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
pdftract-cli	feat(pdftract-27n3): implement border padding, pipeline orchestration, and fixtures	2026-05-23 21:55:11 -04:00
pdftract-core	feat(pdftract-1ijc): implement HOCR output parsing with quick-xml	2026-05-24 00:26:57 -04:00
pdftract-libpdftract	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
pdftract-py	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00