pdftract

History

jedarden ccd13f1bfa feat(pdftract-1vrxg): implement word-break normalization Implement `normalize_word_breaks(span: &mut Span, script_hint: Option<Script>) -> u32` that strips zero-width formatting characters based on script requirements. - U+200B (zero-width space) and U+FEFF (BOM): ALWAYS stripped (never content) - U+200C (ZWNJ) and U+200D (ZWJ): stripped unless script requires them - Preserved for Arabic, Hebrew, Devanagari, Bengali, Indic, Thai, Lao, Tibetan, Myanmar, Khmer, Sinhala (orthographic in complex scripts) - Stripped for Latin and Unknown scripts (noise in extracted text) - `detect_script()` function identifies dominant script from Unicode codepoint ranges (threshold: >=3 matching characters) - `Script` enum with `preserves_joiners()` method determines ZWNJ/ZWJ handling - Returns count of stripped characters (bytes) Acceptance criteria: - "auto\u{200B}mation" (Latin) -> "automation" ✓ - Arabic ZWNJ/ZWJ with script_hint=Arabic -> preserved ✓ - Arabic ZWNJ/ZWJ with script_hint=None -> stripped ✓ - "\u{FEFF}hello" -> "hello" (BOM always stripped) ✓ - Devanagari ZWJ with script_hint=Devanagari -> preserved ✓ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-27 22:55:57 -04:00
..
pdftract-cer-diff	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
pdftract-cli	feat(pdftract-2825c): add comparison mode support to inspector frontend	2026-05-27 22:52:15 -04:00
pdftract-core	feat(pdftract-1vrxg): implement word-break normalization	2026-05-27 22:55:57 -04:00
pdftract-libpdftract	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter	2026-05-24 04:57:17 -04:00
pdftract-py	feat(pdftract-1tswa): implement GIL release with py.allow_threads on extraction entry points	2026-05-26 21:23:00 -04:00