feat(pdftract-1vrxg): implement word-break normalization
Implement `normalize_word_breaks(span: &mut Span, script_hint: Option<Script>) -> u32`
that strips zero-width formatting characters based on script requirements.
- U+200B (zero-width space) and U+FEFF (BOM): ALWAYS stripped (never content)
- U+200C (ZWNJ) and U+200D (ZWJ): stripped unless script requires them
- Preserved for Arabic, Hebrew, Devanagari, Bengali, Indic, Thai, Lao,
Tibetan, Myanmar, Khmer, Sinhala (orthographic in complex scripts)
- Stripped for Latin and Unknown scripts (noise in extracted text)
- `detect_script()` function identifies dominant script from Unicode codepoint
ranges (threshold: >=3 matching characters)
- `Script` enum with `preserves_joiners()` method determines ZWNJ/ZWJ handling
- Returns count of stripped characters (bytes)
Acceptance criteria:
- "auto\u{200B}mation" (Latin) -> "automation" ✓
- Arabic ZWNJ/ZWJ with script_hint=Arabic -> preserved ✓
- Arabic ZWNJ/ZWJ with script_hint=None -> stripped ✓
- "\u{FEFF}hello" -> "hello" (BOM always stripped) ✓
- Devanagari ZWJ with script_hint=Devanagari -> preserved ✓
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>