pdftract/crates
jedarden d84f8da3a4 feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs
Implements Phase 4.7 Correction Pipeline step 3: mojibake detection
and repair for Latin-1 bytes misinterpreted as UTF-8.

Changes:
- Add layout::correction module with detect_and_repair_mojibake function
- Implement CorrectableText trait for mutable text access
- Add trait implementations for hybrid::Span and schema::SpanJson
- Make encoding_rs a non-optional dependency (was cjk-gated)
- Detection heuristic: 2+ occurrences of telltale sequences (é, è, ’, etc.)
- Re-decode via encoding_rs::WINDOWS_1252 when detected
- Accept repair only if readability score improves by >0.05 epsilon
- Fast-path pass-through for ASCII-only and clean UTF-8 text

Closes: pdftract-5qj50

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 17:01:53 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli feat(pdftract-64p5): implement classify CLI subcommand and --auto flag 2026-05-24 15:16:56 -04:00
pdftract-core feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs 2026-05-24 17:01:53 -04:00
pdftract-libpdftract feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
pdftract-py feat(pdftract-2nu0s): implement Python SDK contract conformance 2026-05-24 08:55:11 -04:00