jedarden
19c1fc2e84
docs(pdftract-1vrxg): verify word-break normalization implementation
...
All acceptance criteria PASS:
- Latin text: U+200B/U+FEFF/U+200C/U+200D stripped
- Arabic/Indic: ZWNJ/ZWJ preserved when script_hint provided
- Unknown script: all characters stripped (safe default)
- Script auto-detection from span text working correctly
34 tests passing across normalize_word_breaks, detect_script, and preserves_joiners.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:04:44 -04:00
jedarden
ccd13f1bfa
feat(pdftract-1vrxg): implement word-break normalization
...
Implement `normalize_word_breaks(span: &mut Span, script_hint: Option<Script>) -> u32`
that strips zero-width formatting characters based on script requirements.
- U+200B (zero-width space) and U+FEFF (BOM): ALWAYS stripped (never content)
- U+200C (ZWNJ) and U+200D (ZWJ): stripped unless script requires them
- Preserved for Arabic, Hebrew, Devanagari, Bengali, Indic, Thai, Lao,
Tibetan, Myanmar, Khmer, Sinhala (orthographic in complex scripts)
- Stripped for Latin and Unknown scripts (noise in extracted text)
- `detect_script()` function identifies dominant script from Unicode codepoint
ranges (threshold: >=3 matching characters)
- `Script` enum with `preserves_joiners()` method determines ZWNJ/ZWJ handling
- Returns count of stripped characters (bytes)
Acceptance criteria:
- "auto\u{200B}mation" (Latin) -> "automation" ✓
- Arabic ZWNJ/ZWJ with script_hint=Arabic -> preserved ✓
- Arabic ZWNJ/ZWJ with script_hint=None -> stripped ✓
- "\u{FEFF}hello" -> "hello" (BOM always stripped) ✓
- Devanagari ZWJ with script_hint=Devanagari -> preserved ✓
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:55:57 -04:00