jedarden/pdftract

Fork 0

Commit graph

Author	SHA1	Message	Date
jedarden	19c1fc2e84	docs(pdftract-1vrxg): verify word-break normalization implementation All acceptance criteria PASS: - Latin text: U+200B/U+FEFF/U+200C/U+200D stripped - Arabic/Indic: ZWNJ/ZWJ preserved when script_hint provided - Unknown script: all characters stripped (safe default) - Script auto-detection from span text working correctly 34 tests passing across normalize_word_breaks, detect_script, and preserves_joiners. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:04:44 -04:00
jedarden	ccd13f1bfa	feat(pdftract-1vrxg): implement word-break normalization Implement `normalize_word_breaks(span: &mut Span, script_hint: Option<Script>) -> u32` that strips zero-width formatting characters based on script requirements. - U+200B (zero-width space) and U+FEFF (BOM): ALWAYS stripped (never content) - U+200C (ZWNJ) and U+200D (ZWJ): stripped unless script requires them - Preserved for Arabic, Hebrew, Devanagari, Bengali, Indic, Thai, Lao, Tibetan, Myanmar, Khmer, Sinhala (orthographic in complex scripts) - Stripped for Latin and Unknown scripts (noise in extracted text) - `detect_script()` function identifies dominant script from Unicode codepoint ranges (threshold: >=3 matching characters) - `Script` enum with `preserves_joiners()` method determines ZWNJ/ZWJ handling - Returns count of stripped characters (bytes) Acceptance criteria: - "auto\u{200B}mation" (Latin) -> "automation" ✓ - Arabic ZWNJ/ZWJ with script_hint=Arabic -> preserved ✓ - Arabic ZWNJ/ZWJ with script_hint=None -> stripped ✓ - "\u{FEFF}hello" -> "hello" (BOM always stripped) ✓ - Devanagari ZWJ with script_hint=Devanagari -> preserved ✓ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:55:57 -04:00

Author

SHA1

Message

Date

jedarden

19c1fc2e84

docs(pdftract-1vrxg): verify word-break normalization implementation

All acceptance criteria PASS:
- Latin text: U+200B/U+FEFF/U+200C/U+200D stripped
- Arabic/Indic: ZWNJ/ZWJ preserved when script_hint provided
- Unknown script: all characters stripped (safe default)
- Script auto-detection from span text working correctly

34 tests passing across normalize_word_breaks, detect_script, and preserves_joiners.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 23:04:44 -04:00

jedarden

ccd13f1bfa

feat(pdftract-1vrxg): implement word-break normalization

Implement `normalize_word_breaks(span: &mut Span, script_hint: Option<Script>) -> u32`
that strips zero-width formatting characters based on script requirements.

- U+200B (zero-width space) and U+FEFF (BOM): ALWAYS stripped (never content)
- U+200C (ZWNJ) and U+200D (ZWJ): stripped unless script requires them
  - Preserved for Arabic, Hebrew, Devanagari, Bengali, Indic, Thai, Lao,
    Tibetan, Myanmar, Khmer, Sinhala (orthographic in complex scripts)
  - Stripped for Latin and Unknown scripts (noise in extracted text)

- `detect_script()` function identifies dominant script from Unicode codepoint
  ranges (threshold: >=3 matching characters)
- `Script` enum with `preserves_joiners()` method determines ZWNJ/ZWJ handling
- Returns count of stripped characters (bytes)

Acceptance criteria:
- "auto\u{200B}mation" (Latin) -> "automation" ✓
- Arabic ZWNJ/ZWJ with script_hint=Arabic -> preserved ✓
- Arabic ZWNJ/ZWJ with script_hint=None -> stripped ✓
- "\u{FEFF}hello" -> "hello" (BOM always stripped) ✓
- Devanagari ZWJ with script_hint=Devanagari -> preserved ✓

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 22:55:57 -04:00

2 commits