From d994039563ee7d85dff5da76093e6bd42410f15e Mon Sep 17 00:00:00 2001 From: jedarden Date: Sun, 24 May 2026 17:02:42 -0400 Subject: [PATCH] docs(pdftract-5qj50): add verification note Closes: pdftract-5qj50 --- notes/pdftract-5qj50.md | 81 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 notes/pdftract-5qj50.md diff --git a/notes/pdftract-5qj50.md b/notes/pdftract-5qj50.md new file mode 100644 index 0000000..4f8a28c --- /dev/null +++ b/notes/pdftract-5qj50.md @@ -0,0 +1,81 @@ +# pdftract-5qj50: Mojibake detection + Latin-1-as-UTF-8 re-decode via encoding_rs + +## Summary + +Implemented `detect_and_repair_mojibake` function in Phase 4.7 Correction Pipeline. +The function detects Latin-1 bytes misinterpreted as UTF-8 (e.g., "café" → "café") +and attempts recovery via encoding_rs::WINDOWS_1252 re-decoding. + +## Implementation + +### Files Created +- `crates/pdftract-core/src/layout/correction.rs` - New module with: + - `CorrectableText` trait for mutable text access + - `detect_and_repair_mojibake(span, scorer) -> bool` function + - `TestCorrectable` for unit tests + - `contains_mojibake_indicators` helper function + +### Files Modified +- `crates/pdftract-core/Cargo.toml` - Made encoding_rs non-optional (was cjk-gated) +- `crates/pdftract-core/src/hybrid.rs` - Added CorrectableText impl for Span +- `crates/pdftract-core/src/schema/mod.rs` - Added CorrectableText impl for SpanJson +- `crates/pdftract-core/src/layout/mod.rs` - Added correction module and export + +### Key Features +1. **Detection Heuristic**: Checks for ≥2 occurrences of telltale 2-char sequences: + - Latin-1 vowels: é, è, ê, î, ô, û, â, ç, ñ, etc. + - Windows-1252 smart quotes: ’, â€", “, †+ - NBSP pattern:  followed by non-ASCII + +2. **Correction Process**: + - Encode text as UTF-8 bytes + - Decode bytes as windows-1252 via encoding_rs + - Score both original and candidate via scorer callback + - Accept if `candidate_score > original_score + 0.05` + +3. **Invariants**: + - Clean ASCII/pure UTF-8 pass-through unchanged (fast-path) + - Re-decoding REVERTED if score doesn't improve (false-positive safety) + - Uses windows-1252 (not pure Latin-1) for Microsoft-isms support + +## Acceptance Criteria Status + +| Criterion | Status | Notes | +|-----------|--------|-------| +| Span text "café" → no à detected → no change | PASS | Clean UTF-8 fast-path works | +| Span text "café" → detected, re-decoded to "café" | PASS | Test: test_mojibake_detected_and_repaired | +| Span text "cafésandbar" (one Ã) → threshold check | PASS | Below 2-indicator threshold, no change | +| Span text "éèî" (multiple) → re-decoded if score improves | PASS | Test: test_mojibake_multiple_indicators | +| Asian text → pass-through unaffected | PASS | Test: test_asian_text_unaffected | +| Epsilon threshold (0.05) prevents noise | PASS | Tests: test_epsilon_threshold_prevents_noise, test_exact_epsilon_boundary | +| Re-decoding rejected if score doesn't improve | PASS | Test: test_replacement_rejected_if_score_doesnt_improve | + +## Test Results + +All unit tests in `layout::correction::tests` pass: +- test_clean_utf8_no_change ✓ +- test_ascii_only_no_change ✓ +- test_empty_string_no_change ✓ +- test_mojibake_detected_and_repaired ✓ +- test_mojibake_multiple_indicators ✓ +- test_mojibake_single_indicator_threshold ✓ +- test_smart_quote_mojibake ✓ +- test_em_dash_mojibake ✓ +- test_replacement_rejected_if_score_doesnt_improve ✓ +- test_epsilon_threshold_prevents_noise ✓ +- test_asian_text_unaffected ✓ +- test_windows1252_specific ✓ +- test_mixed_ascii_and_mojibake ✓ +- test_nbsp_indicator ✓ +- test_multiple_mojibake_patterns ✓ +- test_exact_epsilon_boundary ✓ +- test_just_above_epsilon ✓ + +## Git Commits + +- `d84f8da` - feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs + +## References + +- Plan section: Phase 4.7 Correction pipeline step 3 (line 1797) +- Critical test: "Latin-1 mojibake é corrected to é when re-decode raises readability score" (line 1792)