jedarden 31e715633d Add four research documents on text quality and document-type handling

- text-readability-validation: character/word/entropy/perplexity checks,
  symbol font detection, remediation decision tree, span quality metadata
- post-ocr-text-correction: error taxonomy, confusable tables, noisy channel
  n-gram model, regex patterns, hyphenation, layout-based correction pipeline
- presentation-and-spreadsheet-pdfs: detection heuristics, slide structure,
  bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries,
  cell type inference, Rust output schema
- semantic-text-reconstruction: beam search n-gram reconstruction, NER
  validation, domain lexicons, cross-span consistency, abbreviation expansion,
  citation repair, coherence scoring, ReconstructedSpan output schema

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:07:30 -04:00

15 KiB

Raw Permalink Blame History

Semantic Text Reconstruction in PDF Extraction

Character-level Unicode recovery and word-level normalization handle the majority of extraction errors, but a class of failures resists both: cases where the raw bytes decode to plausible characters that nonetheless form meaningless or ambiguous text. Fixing these requires semantic context — understanding what the text is about, not just how individual glyphs encode. This document describes the algorithms, data structures, and Rust engineering concerns for implementing a semantic reconstruction layer in pdftract.

1. When Semantic Reconstruction Is Needed

Several classes of content systematically evade character- and word-level correction:

Multi-word technical terms split across encoding boundaries. A term like "polymerase chain reaction" may straddle a font-switch boundary mid-phrase, producing one half in correct encoding and the other in a shifted code page. Each half looks like a valid English word; only the joined phrase reveals the error.

Scientific names garbled by font substitution. Italic species names (Escherichia coli, Drosophila melanogaster) are often typeset in a separate italic font with its own broken ToUnicode CMap. The substitution produces character-by-character errors that a spell-checker cannot catch because the garbled form may accidentally be a common word.

Proper nouns and acronyms. A person's name, an organization acronym, or a product name has no entry in a general dictionary. Extraction errors in these spans go undetected by dictionary lookup. NER provides the discriminating signal.

Cross-lingual content. A Latin phrase in an English document (inter alia, habeas corpus) may be typeset in a decorative font without a ToUnicode entry, producing garbled output. The correct text is not in an English dictionary, but it is in a Latin lexicon and is highly recognizable by n-gram models trained on legal Latin.

Hyphenated German compounds. Words like Verschlüsselungsalgorithmus or Haftpflichtversicherung do not appear in compact dictionaries. Hyphenated splits (Haftpflicht-versicherung) add ambiguity: is the hyphen intentional or a line-break artifact? Compound-aware morphological analysis is the only reliable path.

Mathematical notation in prose. A formula like "O(n log n)" contains parentheses, letters, and operators whose font encodings frequently diverge. The span must be recognized as mathematical before any correction is applied; applying word-level normalization to a formula destroys it.

2. N-gram Context Reconstruction

When a span is flagged as low-confidence (below a configurable confidence_threshold: f32), the reconstructor enumerates alternative interpretations and scores each using a language model.

Character n-gram scoring. A character 5-gram language model, trained on a representative corpus in the target language, assigns a log-probability to each candidate string. For each low-confidence character position, enumerate substitutions constrained by the character confusable set — characters whose glyph shapes are known to be confused by common encoding bugs (e.g., l↔1, O↔0, rn↔m, cl↔d). The confusable set is stored as a HashMap<char, SmallVec<[char; 4]>> for O(1) lookup.

Word n-gram scoring. After candidate character strings are generated, score them as word sequences using a compressed word bigram or trigram model (ARPA format, loaded as a finite-state acceptor). This promotes candidates that form fluent phrases over candidates that score well character-by-character but form nonsense word sequences.

Beam search. Enumerate alternatives using beam search over the character lattice of the low-confidence span. At each position, retain the top-k partial hypotheses by cumulative log-probability. Typical values: beam width k = 16 for character-level search, k = 8 after word n-gram rescoring. Wider beams improve recall at quadratic cost; the tradeoff is configurable via ReconstructorConfig { beam_width: usize, max_span_chars: usize }. For spans longer than max_span_chars (default 40), fall back to greedy decoding to bound computation.

Pruning. Before scoring, prune hypotheses that violate hard constraints: the candidate must not contain characters outside the Unicode category set observed in surrounding context; the edit distance from the raw decoded form must not exceed a per-character threshold (default 1.5 edits per 10 characters). This eliminates the combinatorial explosion from considering arbitrary substitutions.

3. Named Entity Recognition for Validation

A lightweight NER model — a CRF or a small transformer quantized to 8-bit integers — classifies spans as Person, Organization, Location, Date, Number, or Other. NER serves two roles: identifying what kind of thing a span is, and then applying entity-type-specific validation.

Type-specific validation. A span classified as Date must parse as a valid calendar date under the document's locale. A span classified as Number must be parseable as an integer, decimal, or scientific notation value. A span classified as Organization is checked against a pre-loaded organization lexicon. When validation fails, the span is flagged for reconstruction; when it passes, the raw extraction is accepted regardless of character-level confidence.

Context-conditioned classification. The entity classifier uses a sliding window of surrounding tokens as context. A span surrounded by financial terminology (ticker symbols, "EBITDA", "basis points") is classified as a financial entity before the span itself is inspected. This reduces false positives where a garbled span accidentally resembles a common word but is semantically a proper noun.

4. Domain-Specific Lexicons

A general English dictionary misses the vast majority of domain vocabulary. pdftract loads supplementary lexicons based on a document classification step that runs before reconstruction.

Domain lexicon types: legal (Latin maxims, case citation formats, court names), medical (ICD-10 codes, drug generic and brand names, anatomical terms, lab value abbreviations), financial (ticker symbols, CUSIP/ISIN patterns, ratio names), scientific (IUPAC chemical names, species binomials, journal abbreviations per ISO 4).

Document classification trigger. A lightweight bag-of-words classifier (multinomial naive Bayes, ~50 KB model) classifies the document into one or more domains after the first-pass extraction. Domains with posterior probability above 0.6 trigger loading the corresponding lexicon. Multiple domains are possible (a clinical trial report is both medical and statistical).

Bloom filter storage. Each domain lexicon is stored as a Bloom filter for O(1) membership queries with bounded false positive rate. A 16-bit cuckoo filter (using the cuckoofilter crate or a hand-rolled implementation) achieves a 0.1% false positive rate at 12 bits per entry. Term lookup: normalize the candidate string (lowercase, NFC), query the filter; on a positive hit, optionally verify against a sorted &[u8] slice for exact confirmation. Total storage for a 500,000-term medical lexicon is approximately 750 KB.

5. Cross-Span Consistency

The same visual glyph sequence must extract consistently across the document. Inconsistent extraction — where the same term appears as "photosynthesis" on page 3 and "ph0tosynthesis" on page 17 due to differing font embedding quality — is a common failure mode in multi-section PDFs assembled from separately authored chapters.

Canonical form selection. After per-page extraction, group textual spans by their character-level fingerprint: the sequence of Unicode general categories and ASCII characters, with non-ASCII collapsed to a category placeholder. Within each group, select the canonical form as the extraction with the highest summed confidence score; if confidence is tied, use the most frequent string. Write the canonical form to a HashMap<SpanFingerprint, Arc<str>>.

Normalization pass. In a second pass, any span whose extracted string differs from the canonical form for its fingerprint is replaced with the canonical form and its reconstruction_method set to CrossSpan. The replacement is only performed when the edit distance between the variant and the canonical form is below a threshold (default: 15% of canonical length), preventing spurious merging of genuinely distinct terms.

6. Abbreviation Expansion

Abbreviations break sentence boundary detection, inflate vocabulary, and reduce readability. The reconstruction layer handles three kinds:

Standard abbreviations. A trie-based lookup (AhoCorasick automaton from the aho-corasick crate) matches spans against a compiled list of standard abbreviations ("e.g.", "i.e.", "cf.", "op. cit.", "et al."). On a match, the span is tagged with its expansion in the output metadata; the text field retains the abbreviated form by default (expansion is opt-in via ReconstructorConfig { expand_abbreviations: bool }).

Document-internal definitions. The pattern <full form> (<short form>) defines a document-local abbreviation. Detected using a regex over the token stream: \b([A-Z][a-z]+(?: [A-Z][a-z]+)+)\s+\(([A-Z]{2,8})\). On detection, insert into a per-document HashMap<String, String> mapping short form to long form. All subsequent occurrences of the short form are expanded using this map, with reconstruction_method set to DocumentAbbrev.

Ambiguous period handling. A period following a known abbreviation is not a sentence boundary. This table is shared between the abbreviation expander and the sentence boundary detector.

7. Reference and Citation Reconstruction

Bibliography sections have distinct extraction failure patterns: author names are frequently reordered or truncated, journal titles run together with volume numbers, and DOIs are broken by line-wrap hyphenation.

Zone detection. The bibliography zone is identified by a combination of signals: high density of year-pattern tokens (four-digit sequences in the range 1900–2050), high density of capitalized multi-word sequences, and a section header matching a list of known bibliography heading strings. Once identified, reconstruction rules specific to the reference zone are applied.

Structural validation. A DOI must match 10\.\d{4,9}/[-._;()/:A-Za-z0-9]+. An ISSN must match \d{4}-\d{3}[\dX]. A year must parse as an integer in the range 1700–2100. When a span in the reference zone partially matches one of these patterns but contains obvious character substitutions (e.g., l0.1038/... where l0 should be 10), apply a targeted correction using the pattern as a template.

DOI normalization. Hyphens inserted by PDF line-wrapping inside a DOI are detected by splitting on hyphen and checking whether reassembly yields a valid DOI regex match. If so, remove the hyphen.

8. Sentence Boundary Detection

Periods are the most ambiguous character in prose text. A period may end a sentence, terminate an abbreviation, separate decimal digits, form an ellipsis, or appear inside a URL or DOI. A rule-based sentence boundary detector resolves ambiguity in order:

If the preceding token is in the abbreviation table, the period is not a sentence boundary.
If the preceding token is a single uppercase letter (initials), the period is not a sentence boundary.
If the following token begins with a lowercase letter, the period is not a sentence boundary.
If the period is followed by another period (ellipsis), it is not a sentence boundary.
Otherwise, the period is a sentence boundary.

PDF-specific complications. A line break in a PDF content stream does not imply a sentence break. After geometry-based line joining (handled in an earlier normalization stage), the sentence detector operates on joined paragraphs, not raw lines. Hyphenated line-end tokens that have been rejoined must not present a spurious word boundary to the detector.

9. Coherence Scoring for Reconstruction Candidates

When multiple reconstructions of a passage are plausible, select the best using a composite score:

(a) Word n-gram perplexity. Lower perplexity is better. Computed using a trigram model with Kneser-Ney smoothing.
(b) Part-of-speech sequence probability. Tag the candidate with a fast POS tagger (e.g., averaged perceptron); score the POS sequence under a bigram POS language model. Promotes syntactically coherent candidates.
(c) Entity consistency. Count entity type conflicts between the candidate and the surrounding 200-token context window. A candidate that introduces an unexpected entity type (e.g., a location in a paragraph that is otherwise about persons) is penalized.
(d) Semantic similarity. Encode the candidate and the surrounding paragraph using a compact sentence embedding model (e.g., a 4-layer distilled transformer, ~25 MB) and compute cosine similarity. Candidates that are semantically distant from their context are penalized.

The composite score is a weighted sum: score = w_ppl * ppl + w_pos * pos_cost + w_entity * entity_penalty - w_sem * cos_sim. Weights are configurable; default values are calibrated on a mixed-domain PDF corpus.

10. Output and Confidence

Each reconstructed span in the pdftract output carries the following fields:

pub struct ReconstructedSpan {
    /// Raw bytes as decoded before reconstruction, percent-encoded if non-UTF-8.
    pub original_raw: String,
    /// Final reconstructed string, normalized to NFC.
    pub text: String,
    /// True if any reconstruction algorithm modified `text` relative to `original_raw`.
    pub reconstruction_applied: bool,
    /// Confidence in the reconstructed text, in [0.0, 1.0].
    pub reconstruction_confidence: f32,
    /// The primary algorithm responsible for the reconstruction.
    pub reconstruction_method: ReconstructionMethod,
}

pub enum ReconstructionMethod {
    None,
    Dictionary,
    Ngram,
    Entity,
    CrossSpan,
    DomainLexicon,
    DocumentAbbrev,
}

Page-level metrics. Each ExtractedPage carries a reconstruction_rate: f32 — the fraction of spans on that page for which reconstruction_applied is true. A page with reconstruction_rate > 0.3 should be flagged in the caller's output as a low-quality extraction, potentially warranting an OCR fallback. The ReconstructionMethod distribution across a page (accessible via page.reconstruction_method_histogram()) gives the caller a breakdown of why reconstruction was applied, which is useful for diagnosing systematic problems in a PDF batch.

When reconstruction_applied is false, reconstruction_confidence reflects the confidence of the original extraction, not of the reconstruction pass. This preserves the meaning of the field: it always represents the extractor's confidence in text, not in the decision to reconstruct.

15 KiB Raw Permalink Blame History Unescape Escape