docs(pdftract-5qj50): add verification note
Closes: pdftract-5qj50
This commit is contained in:
parent
d84f8da3a4
commit
d994039563
1 changed files with 81 additions and 0 deletions
81
notes/pdftract-5qj50.md
Normal file
81
notes/pdftract-5qj50.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
# pdftract-5qj50: Mojibake detection + Latin-1-as-UTF-8 re-decode via encoding_rs
|
||||
|
||||
## Summary
|
||||
|
||||
Implemented `detect_and_repair_mojibake` function in Phase 4.7 Correction Pipeline.
|
||||
The function detects Latin-1 bytes misinterpreted as UTF-8 (e.g., "café" → "café")
|
||||
and attempts recovery via encoding_rs::WINDOWS_1252 re-decoding.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Files Created
|
||||
- `crates/pdftract-core/src/layout/correction.rs` - New module with:
|
||||
- `CorrectableText` trait for mutable text access
|
||||
- `detect_and_repair_mojibake<T, F>(span, scorer) -> bool` function
|
||||
- `TestCorrectable` for unit tests
|
||||
- `contains_mojibake_indicators` helper function
|
||||
|
||||
### Files Modified
|
||||
- `crates/pdftract-core/Cargo.toml` - Made encoding_rs non-optional (was cjk-gated)
|
||||
- `crates/pdftract-core/src/hybrid.rs` - Added CorrectableText impl for Span
|
||||
- `crates/pdftract-core/src/schema/mod.rs` - Added CorrectableText impl for SpanJson
|
||||
- `crates/pdftract-core/src/layout/mod.rs` - Added correction module and export
|
||||
|
||||
### Key Features
|
||||
1. **Detection Heuristic**: Checks for ≥2 occurrences of telltale 2-char sequences:
|
||||
- Latin-1 vowels: é, è, ê, î, ô, û, â, ç, ñ, etc.
|
||||
- Windows-1252 smart quotes: ’, â€", “, â€
|
||||
- NBSP pattern: Â followed by non-ASCII
|
||||
|
||||
2. **Correction Process**:
|
||||
- Encode text as UTF-8 bytes
|
||||
- Decode bytes as windows-1252 via encoding_rs
|
||||
- Score both original and candidate via scorer callback
|
||||
- Accept if `candidate_score > original_score + 0.05`
|
||||
|
||||
3. **Invariants**:
|
||||
- Clean ASCII/pure UTF-8 pass-through unchanged (fast-path)
|
||||
- Re-decoding REVERTED if score doesn't improve (false-positive safety)
|
||||
- Uses windows-1252 (not pure Latin-1) for Microsoft-isms support
|
||||
|
||||
## Acceptance Criteria Status
|
||||
|
||||
| Criterion | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| Span text "café" → no à detected → no change | PASS | Clean UTF-8 fast-path works |
|
||||
| Span text "café" → detected, re-decoded to "café" | PASS | Test: test_mojibake_detected_and_repaired |
|
||||
| Span text "cafésandbar" (one Ã) → threshold check | PASS | Below 2-indicator threshold, no change |
|
||||
| Span text "éèî" (multiple) → re-decoded if score improves | PASS | Test: test_mojibake_multiple_indicators |
|
||||
| Asian text → pass-through unaffected | PASS | Test: test_asian_text_unaffected |
|
||||
| Epsilon threshold (0.05) prevents noise | PASS | Tests: test_epsilon_threshold_prevents_noise, test_exact_epsilon_boundary |
|
||||
| Re-decoding rejected if score doesn't improve | PASS | Test: test_replacement_rejected_if_score_doesnt_improve |
|
||||
|
||||
## Test Results
|
||||
|
||||
All unit tests in `layout::correction::tests` pass:
|
||||
- test_clean_utf8_no_change ✓
|
||||
- test_ascii_only_no_change ✓
|
||||
- test_empty_string_no_change ✓
|
||||
- test_mojibake_detected_and_repaired ✓
|
||||
- test_mojibake_multiple_indicators ✓
|
||||
- test_mojibake_single_indicator_threshold ✓
|
||||
- test_smart_quote_mojibake ✓
|
||||
- test_em_dash_mojibake ✓
|
||||
- test_replacement_rejected_if_score_doesnt_improve ✓
|
||||
- test_epsilon_threshold_prevents_noise ✓
|
||||
- test_asian_text_unaffected ✓
|
||||
- test_windows1252_specific ✓
|
||||
- test_mixed_ascii_and_mojibake ✓
|
||||
- test_nbsp_indicator ✓
|
||||
- test_multiple_mojibake_patterns ✓
|
||||
- test_exact_epsilon_boundary ✓
|
||||
- test_just_above_epsilon ✓
|
||||
|
||||
## Git Commits
|
||||
|
||||
- `d84f8da` - feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs
|
||||
|
||||
## References
|
||||
|
||||
- Plan section: Phase 4.7 Correction pipeline step 3 (line 1797)
|
||||
- Critical test: "Latin-1 mojibake é corrected to é when re-decode raises readability score" (line 1792)
|
||||
Loading…
Add table
Reference in a new issue