pdftract/crates/pdftract-core/src
jedarden d84f8da3a4 feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs
Implements Phase 4.7 Correction Pipeline step 3: mojibake detection
and repair for Latin-1 bytes misinterpreted as UTF-8.

Changes:
- Add layout::correction module with detect_and_repair_mojibake function
- Implement CorrectableText trait for mutable text access
- Add trait implementations for hybrid::Span and schema::SpanJson
- Make encoding_rs a non-optional dependency (was cjk-gated)
- Detection heuristic: 2+ occurrences of telltale sequences (é, è, ’, etc.)
- Re-decode via encoding_rs::WINDOWS_1252 when detected
- Accept repair only if readability score improves by >0.05 epsilon
- Fast-path pass-through for ASCII-only and clean UTF-8 text

Closes: pdftract-5qj50

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 17:01:53 -04:00
..
annotation feat(pdftract-3r77): implement non-link annotation extractor with subtype-specific fields 2026-05-24 16:52:51 -04:00
attachment feat(pdftract-3lir): implement Filespec dict + EF stream decoder 2026-05-24 13:54:27 -04:00
cache feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
fingerprint feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
font feat(pdftract-2iur): implement nearest-neighbor scanner with Hamming distance and frequency tie-break 2026-05-24 06:57:27 -04:00
forms feat(pdftract-5qca): implement form_fields JSON output + schema integration 2026-05-24 14:36:03 -04:00
layout feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs 2026-05-24 17:01:53 -04:00
ocr/preprocessing feat(pdftract-5xyjv): implement 3x3 median-filter denoising for OCR preprocessing 2026-05-24 16:09:08 -04:00
parser feat(pdftract-66ykq): implement CCITTFaxDecode passthrough with diagnostics 2026-05-24 13:20:25 -04:00
profiles feat(pdftract-64p5): implement classify CLI subcommand and --auto flag 2026-05-24 15:16:56 -04:00
receipts feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
render feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
schema feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs 2026-05-24 17:01:53 -04:00
signature feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
table feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
atomic_file_writer.rs feat(pdftract-68wfa): implement AtomicFileWriter for atomic file writes 2026-05-24 13:02:37 -04:00
classify.rs feat(pdftract-5v1l9): implement BrokenVector escalation for low-readability pages 2026-05-24 16:16:51 -04:00
content_stream.rs feat(pdftract-3r77): implement non-link annotation extractor with subtype-specific fields 2026-05-24 16:52:51 -04:00
diagnostics.rs feat(pdftract-4dmp): implement text state operators Tc Tw Tz TL Ts Tr 2026-05-24 16:37:39 -04:00
document.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
dpi.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
extract.rs feat(pdftract-5tvv1): implement Tagged-PDF fast-path stub with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic 2026-05-24 16:28:10 -04:00
graphics_state.rs feat(pdftract-4dmp): implement text state operators Tc Tw Tz TL Ts Tr 2026-05-24 16:37:39 -04:00
hybrid.rs feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs 2026-05-24 17:01:53 -04:00
lib.rs feat(pdftract-46qa): implement 7.6.1 annotation walker dispatcher 2026-05-24 15:30:45 -04:00
markdown.rs feat(pdftract-5qca): implement form_fields JSON output + schema integration 2026-05-24 14:36:03 -04:00
ocr.rs feat(pdftract-6dki1): implement histogram stretch contrast normalization 2026-05-24 10:30:20 -04:00
options.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
preprocess.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
render.rs feat(pdftract-axcri): record inline images as ImageXObject entries 2026-05-24 07:41:50 -04:00
semaphore.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
span_flags.rs feat(pdftract-cbrbg): implement span flag detector for Phase 4.1 2026-05-24 07:28:25 -04:00
url_validation.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
word_boundary.rs feat(pdftract-h2s0z): implement adaptive word boundary detector 2026-05-24 06:06:56 -04:00