pdftract/docs/research/text-readability-validation.md
jedarden a7673c906f Add 12 research documents covering full PDF extraction surface
Infrastructure and parsing:
- raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration,
  assisted OCR, HOCR alignment, multi-language, performance
- image-and-figure-extraction: XObjects, inline images, filter decoding,
  color spaces, geometry, form XObjects, transparency, figure detection
- form-fields-and-annotations: AcroForm types, XFA, widget appearance
  streams, rich text, annotation text, output schema
- pdf-encryption-and-security: R2-R6 key derivation, object-level
  decryption, permission flags, RustCrypto implementation approach
- page-geometry-and-document-structure: page tree, all five page boxes,
  rotation, coordinate inversion, page labels, outlines, named destinations
- optional-content-groups: OCG/OCMD visibility, usage dictionary, default
  state resolution, content stream marking, multilingual layer patterns
- invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern,
  white-on-white, zero-opacity, clipped text, color tracking
- malformed-pdf-repair-and-recovery: xref recovery, stream length repair,
  syntax tolerance, partial extraction, structured warnings

Quality and metadata:
- xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML
  parsing, conflict resolution, encrypted metadata, thumbnails
- embedded-files-and-portfolios: EmbeddedFile streams, Filespec,
  AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security
- performance-and-streaming-architecture: mmap, lazy loading, NDJSON
  streaming, rayon parallelism, font caching, axum HTTP server
- benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus
  categories, reading order scoring, regression CI, public datasets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:05:42 -04:00

197 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Text Readability Validation
## Overview
Extracting bytes from a PDF font stream and producing a sequence of Unicode codepoints is necessary but not sufficient. A PDF can encode every character correctly at the byte level while still emitting text that is semantically unreadable — because the font has no ToUnicode map, because a custom encoding overlaps with a standard encoding at the wrong offset, or because the renderer selected the wrong code path. This document defines the algorithms and data structures that `pdftract` uses to detect and remediate unreadable output before it reaches a caller.
---
## 1. Failure Modes: What "Unreadable" Looks Like
Unreadable extraction output falls into several distinct categories, each with a different root cause and remediation path.
**Mojibake** occurs when bytes are decoded with the wrong code page. The classic form is Latin-1 interpreted as UTF-8 (or vice versa), producing sequences like `é` for `é` or `’` for `'`. These are valid Unicode codepoints, but they are wrong ones.
**Replacement characters** (U+FFFD) appear when a decoder encounters byte sequences that are invalid in the target encoding. A high density of U+FFFD is an unambiguous signal of encoding mismatch.
**Private Use Area codepoints** (U+E000U+F8FF) are legitimately used in some PDF fonts to encode glyphs that have no standard Unicode assignment, but a prose span where more than a small fraction of codepoints are PUA almost certainly reflects a missing or incorrect ToUnicode map.
**Control characters** in the range U+0000U+001F (excluding U+0009 TAB and U+000A LF) should never appear in prose extracted from a document. Their presence indicates that glyph IDs are being emitted directly without Unicode mapping.
**Symbol font bleed-through** happens when a font that uses Zapf Dingbats, Symbol, or a custom pi font is decoded as if it were a text font. The result is runs of symbols — ♦ ♣ ♥ ♠ — where letters should be.
**Impossible character sequences** for the detected language include strings like `xzqbvw` in English or `aeiouaeiou` in Czech. Natural languages have strong constraints on consonant/vowel alternation and on which n-grams can appear adjacently.
**Mixed-directionality fragments** without a Unicode Bidirectional Algorithm context marker produce visually disordered text when a span mixes Arabic or Hebrew runs with Latin runs and the bidi embedding levels are absent.
**Zero-width characters** — U+200B ZERO WIDTH SPACE, U+200C/D ZWNJ/ZWJ, U+FEFF BOM used mid-stream — should be rare in extracted prose; dense runs of them indicate malformed CMap output.
---
## 2. Character-Level Validity Checks
Character-level checks are the first filter in the validation pipeline. They operate per-span in O(n) time with no external data dependencies.
**U+FFFD density:** compute `replacement_ratio = fffd_count / total_codepoints`. Flag the span as `"garbled"` if `replacement_ratio > 0.10` and `"low"` quality if `replacement_ratio > 0.02`.
**PUA density:** compute `pua_ratio` over U+E000U+F8FF and Supplementary PUA (U+F0000U+FFFFF). Flag as `"garbled"` if `pua_ratio > 0.40`. A small PUA ratio (< 0.05) may be acceptable for documents using custom ligature glyphs.
**Control character scan:** iterate codepoints; any U+0000U+0008, U+000BU+001F (excluding 0x09, 0x0A) in a prose span is an immediate `"low"` flag and adds `"control_chars"` to `quality_signals`.
**Combining character orphans:** a sequence of combining characters (Unicode category M) not preceded by a base character (category L, N, or P) indicates CMap corruption. Detect runs of three or more consecutive combining characters.
**Anomalous Unicode block concentration:** compute the fraction of codepoints falling in Mathematical Alphanumeric Symbols (U+1D400U+1D7FF) or Enclosed Alphanumerics (U+2460U+24FF). Values above 0.15 in a prose context indicate symbol font confusion.
**Unicode category distribution:** for a valid English paragraph, the dominant categories are `Ll` (lowercase letter), `Lu` (uppercase letter), `Nd` (decimal digit), `Po` (other punctuation), and `Zs` (space separator). Compute category histograms and compare against expected priors. A span where `Lo` (other letter CJK, Arabic, etc.) exceeds 0.60 but the document language is detected as Latin-script warrants a `"medium"` flag.
---
## 3. Word-Level Validity
Word-level checks require tokenizing the span on whitespace and punctuation boundaries, then evaluating each token.
**Bloom filter word list:** maintain a Bloom filter over a ~500,000-word corpus (one per supported language) stored in approximately 3 MB per language at a 0.1% false positive rate using 10 hash functions. The filter supports O(1) probabilistic membership queries. In Rust, the `bloomfilter` crate or a hand-rolled implementation over `xxhash` works well. Load the filter lazily per detected language.
**Real-word ratio:** `real_word_ratio = (dictionary_hits + numeric_tokens) / total_tokens`. Require `real_word_ratio >= 0.60` for `"high"` quality. Values in [0.35, 0.60) map to `"medium"`. Below 0.35, flag `"low"`.
**Consonant/vowel ratio:** for Latin-script languages, compute the ratio of consonant letters to vowel letters in the span. English prose clusters around 1.41.8. A ratio above 5.0 or below 0.3 is anomalous. This check catches both garbled encoding and accidental extraction of phoneme tables.
**Character n-gram plausibility:** build a bigram or trigram presence set from a reference corpus (compactly encoded as a sorted array of 16-bit hashes). For each character trigram in the extracted text, check membership. If more than 20% of trigrams are absent from the reference set, add `"ngram_anomaly"` to `quality_signals`. Trigrams like `fsqz`, `bxwk`, or `qzjv` have near-zero frequency in English and their presence is diagnostic.
---
## 4. Entropy-Based Detection
Shannon entropy provides a language-agnostic, O(n) garble detector. Compute character-level entropy over a span as:
```
H = -Σ p(c) * log2(p(c))
```
where `p(c)` is the empirical frequency of codepoint `c` in the span.
Expected entropy ranges:
- English prose: 4.05.0 bits/char
- Random Unicode glyphs: 7.08.0 bits/char
- Repeated patterns or single-glyph runs: < 1.5 bits/char
- Base64 or hex strings: 5.06.0 bits/char
Spans with `H > 6.5` are likely garbled; spans with `H < 1.5` are likely repeated/template noise. Both conditions add `"entropy_anomaly"` to `quality_signals` and reduce quality to at most `"medium"`.
Per-block entropy scoring: divide each page block into 128-codepoint windows and compute entropy per window. A bimodal distribution within a single block (some windows normal, some high entropy) indicates interleaved readable and garbled content, which may call for span-level rather than block-level remediation.
Entropy alone cannot distinguish high-entropy valid content (technical identifiers, URLs, code snippets) from garble. It is a necessary but not sufficient signal; always pair with word-level and n-gram checks.
---
## 5. Language Model Perplexity Scoring
A character-level n-gram language model assigns a probability to each character given its context, enabling perplexity scoring without a word boundary assumption.
**Model choice:** a 4-gram character model trained on language-specific Common Crawl shards. Store log-probabilities in a compact trie or a flat sorted array of (n-gram hash log-prob) pairs. A 4-gram model for English requires approximately 820 MB in this encoding. The `whichlang` crate provides language identification but not perplexity; build or embed a separate compact model.
**Perplexity computation:** for a span of length N, perplexity is `PP = exp(-1/N * Σ log P(c_i | c_{i-3}..c_{i-1}))`. Valid English text has perplexity roughly in [5, 30] under a well-trained model. Garbled text commonly exceeds 200.
**Threshold:** flag spans with perplexity > 100 as `"low"` quality; above 300 as `"garbled"`. Spans below 10 may indicate repeated boilerplate and are worth a separate low-entropy check.
**Runtime tradeoff:** perplexity scoring is more expensive than entropy. Apply it only to spans that pass character-level checks but fail word-level checks — treating it as a second-pass arbiter rather than a first-line filter.
---
## 6. Cross-Validation Between Extraction Paths
When both vector text extraction and OCR output are available for the same page region (e.g., a page with embedded text on which OCR was also run as a confidence check), compare the two using normalized edit distance (Levenshtein distance divided by the length of the longer string).
**Agreement criterion:** if `normalized_edit_distance < 0.15`, both paths agree and confidence is high regardless of individual quality signals. If `0.15 ≤ distance < 0.40`, flag for review but prefer the vector path. If `distance ≥ 0.40`, the paths disagree significantly; use OCR output as a spell-check oracle — compute per-word overlap between OCR and vector output, and prefer whichever achieves higher real-word ratio.
This cross-validation is also useful for detecting symbol font bleed-through: OCR on a symbol font region will produce incoherent results too, which confirms the region is non-textual, whereas OCR on correctly encoded text that the vector path garbled will produce coherent text that diverges significantly from the vector output.
---
## 7. Symbol Font Detection and Recovery
Symbol fonts are the most common source of coherent-looking but semantically wrong text. Detection combines font metadata with codepoint analysis.
**Font-level signal:** inspect the font's `FontDescriptor.Flags` bit field. Bit 3 (`Symbolic`) set and bit 6 (`Nonsymbolic`) clear indicates the font self-declares as symbolic. Additionally, check the font name against known symbol font names: `Symbol`, `ZapfDingbats`, `Wingdings`, `Webdings`, and variants.
**Codepoint-level signal:** compute the fraction of output codepoints in Unicode Dingbats (U+2700U+27BF), Miscellaneous Symbols (U+2600U+26FF), Mathematical Operators (U+2200U+22FF), and Box Drawing (U+2500U+257F). A combined fraction above 0.30 in a body-text span is strongly indicative.
**Remediation:** do not emit these spans as prose. Annotate them with `readable: false`, `quality: "garbled"`, and add `"symbol_font"` to `quality_signals`. If the caller has requested exhaustive extraction, emit the raw codepoints under a `raw_glyphs` field. Do not attempt character correction on symbol font output — the mapping is fundamentally wrong at the encoding level, not the decoding level.
---
## 8. Post-Detection Remediation
When a span fails validation, the remediation decision tree is:
1. **Try font encoding recovery** (see `glyph-recognition-and-unicode-recovery.md`). If the font has a usable glyph outline and the issue is a missing ToUnicode map, heuristic name-based mapping or shape similarity to a reference font may recover the correct codepoints. Re-run validation on the recovered span.
2. **Re-run OCR on the page region** if encoding recovery fails or if the span is flagged `"garbled"` and the page has raster content at sufficient DPI. OCR is slow but authoritative on the visual content. Store the OCR result under `ocr_text` alongside the vector extraction.
3. **Emit with degraded quality metadata** if neither recovery path succeeds or is available. Set `quality: "low"` or `quality: "garbled"` and `readable: false`. Populate `quality_signals` with the list of triggered checks. This allows callers to filter, log, or surface the spans without crashing on unexpected content.
4. **Character-level correction** using edit distance to the nearest dictionary word is a last resort, applicable only to short tokens (≤ 12 characters) that fail the real-word check by a small margin. Compute Levenshtein distance to candidates within distance 2 using a BK-tree over the word list. Apply correction only if a unique nearest neighbor exists at distance 1 and the corrected span passes n-gram validation.
---
## 9. Span-Level Quality Metadata
Each extracted `TextSpan` carries the following readability fields:
```rust
pub struct TextSpan {
pub text: String,
pub quality: SpanQuality, // High, Medium, Low, Garbled
pub readable: bool, // true iff quality is High or Medium
pub quality_signals: Vec<QualitySignal>, // which checks triggered
pub confidence: f32, // 0.01.0 composite score
}
pub enum SpanQuality { High, Medium, Low, Garbled }
pub enum QualitySignal {
ReplacementChars,
PuaCodepoints,
ControlChars,
EntropyAnomaly,
NgramAnomaly,
LowRealWordRatio,
SymbolFont,
CvRatioAnomaly,
CombiningOrphan,
}
```
`quality: "high"` requires: `real_word_ratio ≥ 0.60`, `replacement_ratio < 0.02`, `pua_ratio < 0.05`, entropy in [3.5, 6.5], no `quality_signals` triggered.
`quality: "medium"` requires: at most two non-critical signals triggered, `real_word_ratio ≥ 0.35`, no garble-level entropy.
`quality: "low"` means the span may contain recoverable text but significant anomalies are present.
`quality: "garbled"` means the span almost certainly does not contain readable prose in its current form.
---
## 10. Block-Level Readability Score
Aggregate span quality into a block-level score using a weighted mean:
```
block_score = Σ (span_confidence * span_char_count) / Σ span_char_count
```
Map `SpanQuality` to a base confidence: `High → 1.0`, `Medium → 0.65`, `Low → 0.30`, `Garbled → 0.0`. Adjust by the `confidence` field if finer-grained scoring is available from perplexity.
**Page-level readability score:** compute the character-weighted mean of block scores across the page. A score below 0.50 on a nominally vector page should trigger automatic OCR fallback for that page. Expose both block and page scores in the output:
```rust
pub struct PageReadability {
pub score: f32, // 0.01.0
pub ocr_recommended: bool, // score < threshold
pub block_scores: Vec<(BlockId, f32)>,
}
```
The threshold for `ocr_recommended` is configurable, defaulting to 0.50. Callers building pipelines that prioritize accuracy over speed can lower this to 0.70; callers that trust vector extraction for well-formed documents can raise it to 0.35 or disable the check entirely.
The page-level score also serves as a signal for the block-level zone labeling pipeline (see `document-classification-and-zone-labeling.md`): a page with a score below 0.30 is a candidate for whole-page OCR rather than incremental span recovery.