- text-readability-validation: character/word/entropy/perplexity checks, symbol font detection, remediation decision tree, span quality metadata - post-ocr-text-correction: error taxonomy, confusable tables, noisy channel n-gram model, regex patterns, hyphenation, layout-based correction pipeline - presentation-and-spreadsheet-pdfs: detection heuristics, slide structure, bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries, cell type inference, Rust output schema - semantic-text-reconstruction: beam search n-gram reconstruction, NER validation, domain lexicons, cross-span consistency, abbreviation expansion, citation repair, coherence scoring, ReconstructedSpan output schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
153 lines
15 KiB
Markdown
153 lines
15 KiB
Markdown
# Semantic Text Reconstruction in PDF Extraction
|
||
|
||
Character-level Unicode recovery and word-level normalization handle the majority of extraction errors, but a class of failures resists both: cases where the raw bytes decode to plausible characters that nonetheless form meaningless or ambiguous text. Fixing these requires semantic context — understanding what the text is *about*, not just how individual glyphs encode. This document describes the algorithms, data structures, and Rust engineering concerns for implementing a semantic reconstruction layer in `pdftract`.
|
||
|
||
---
|
||
|
||
## 1. When Semantic Reconstruction Is Needed
|
||
|
||
Several classes of content systematically evade character- and word-level correction:
|
||
|
||
**Multi-word technical terms split across encoding boundaries.** A term like "polymerase chain reaction" may straddle a font-switch boundary mid-phrase, producing one half in correct encoding and the other in a shifted code page. Each half looks like a valid English word; only the joined phrase reveals the error.
|
||
|
||
**Scientific names garbled by font substitution.** Italic species names (*Escherichia coli*, *Drosophila melanogaster*) are often typeset in a separate italic font with its own broken ToUnicode CMap. The substitution produces character-by-character errors that a spell-checker cannot catch because the garbled form may accidentally be a common word.
|
||
|
||
**Proper nouns and acronyms.** A person's name, an organization acronym, or a product name has no entry in a general dictionary. Extraction errors in these spans go undetected by dictionary lookup. NER provides the discriminating signal.
|
||
|
||
**Cross-lingual content.** A Latin phrase in an English document (*inter alia*, *habeas corpus*) may be typeset in a decorative font without a ToUnicode entry, producing garbled output. The correct text is not in an English dictionary, but it is in a Latin lexicon and is highly recognizable by n-gram models trained on legal Latin.
|
||
|
||
**Hyphenated German compounds.** Words like *Verschlüsselungsalgorithmus* or *Haftpflichtversicherung* do not appear in compact dictionaries. Hyphenated splits (*Haftpflicht-versicherung*) add ambiguity: is the hyphen intentional or a line-break artifact? Compound-aware morphological analysis is the only reliable path.
|
||
|
||
**Mathematical notation in prose.** A formula like "O(n log n)" contains parentheses, letters, and operators whose font encodings frequently diverge. The span must be recognized as mathematical before any correction is applied; applying word-level normalization to a formula destroys it.
|
||
|
||
---
|
||
|
||
## 2. N-gram Context Reconstruction
|
||
|
||
When a span is flagged as low-confidence (below a configurable `confidence_threshold: f32`), the reconstructor enumerates alternative interpretations and scores each using a language model.
|
||
|
||
**Character n-gram scoring.** A character 5-gram language model, trained on a representative corpus in the target language, assigns a log-probability to each candidate string. For each low-confidence character position, enumerate substitutions constrained by the *character confusable set* — characters whose glyph shapes are known to be confused by common encoding bugs (e.g., `l`↔`1`, `O`↔`0`, `rn`↔`m`, `cl`↔`d`). The confusable set is stored as a `HashMap<char, SmallVec<[char; 4]>>` for O(1) lookup.
|
||
|
||
**Word n-gram scoring.** After candidate character strings are generated, score them as word sequences using a compressed word bigram or trigram model (ARPA format, loaded as a finite-state acceptor). This promotes candidates that form fluent phrases over candidates that score well character-by-character but form nonsense word sequences.
|
||
|
||
**Beam search.** Enumerate alternatives using beam search over the character lattice of the low-confidence span. At each position, retain the top-`k` partial hypotheses by cumulative log-probability. Typical values: beam width `k = 16` for character-level search, `k = 8` after word n-gram rescoring. Wider beams improve recall at quadratic cost; the tradeoff is configurable via `ReconstructorConfig { beam_width: usize, max_span_chars: usize }`. For spans longer than `max_span_chars` (default 40), fall back to greedy decoding to bound computation.
|
||
|
||
**Pruning.** Before scoring, prune hypotheses that violate hard constraints: the candidate must not contain characters outside the Unicode category set observed in surrounding context; the edit distance from the raw decoded form must not exceed a per-character threshold (default 1.5 edits per 10 characters). This eliminates the combinatorial explosion from considering arbitrary substitutions.
|
||
|
||
---
|
||
|
||
## 3. Named Entity Recognition for Validation
|
||
|
||
A lightweight NER model — a CRF or a small transformer quantized to 8-bit integers — classifies spans as `Person`, `Organization`, `Location`, `Date`, `Number`, or `Other`. NER serves two roles: identifying *what kind of thing* a span is, and then applying entity-type-specific validation.
|
||
|
||
**Type-specific validation.** A span classified as `Date` must parse as a valid calendar date under the document's locale. A span classified as `Number` must be parseable as an integer, decimal, or scientific notation value. A span classified as `Organization` is checked against a pre-loaded organization lexicon. When validation fails, the span is flagged for reconstruction; when it passes, the raw extraction is accepted regardless of character-level confidence.
|
||
|
||
**Context-conditioned classification.** The entity classifier uses a sliding window of surrounding tokens as context. A span surrounded by financial terminology (ticker symbols, "EBITDA", "basis points") is classified as a financial entity before the span itself is inspected. This reduces false positives where a garbled span accidentally resembles a common word but is semantically a proper noun.
|
||
|
||
---
|
||
|
||
## 4. Domain-Specific Lexicons
|
||
|
||
A general English dictionary misses the vast majority of domain vocabulary. `pdftract` loads supplementary lexicons based on a document classification step that runs before reconstruction.
|
||
|
||
**Domain lexicon types:** legal (Latin maxims, case citation formats, court names), medical (ICD-10 codes, drug generic and brand names, anatomical terms, lab value abbreviations), financial (ticker symbols, CUSIP/ISIN patterns, ratio names), scientific (IUPAC chemical names, species binomials, journal abbreviations per ISO 4).
|
||
|
||
**Document classification trigger.** A lightweight bag-of-words classifier (multinomial naive Bayes, ~50 KB model) classifies the document into one or more domains after the first-pass extraction. Domains with posterior probability above 0.6 trigger loading the corresponding lexicon. Multiple domains are possible (a clinical trial report is both medical and statistical).
|
||
|
||
**Bloom filter storage.** Each domain lexicon is stored as a Bloom filter for O(1) membership queries with bounded false positive rate. A 16-bit cuckoo filter (using the `cuckoofilter` crate or a hand-rolled implementation) achieves a 0.1% false positive rate at 12 bits per entry. Term lookup: normalize the candidate string (lowercase, NFC), query the filter; on a positive hit, optionally verify against a sorted `&[u8]` slice for exact confirmation. Total storage for a 500,000-term medical lexicon is approximately 750 KB.
|
||
|
||
---
|
||
|
||
## 5. Cross-Span Consistency
|
||
|
||
The same visual glyph sequence must extract consistently across the document. Inconsistent extraction — where the same term appears as "photosynthesis" on page 3 and "ph0tosynthesis" on page 17 due to differing font embedding quality — is a common failure mode in multi-section PDFs assembled from separately authored chapters.
|
||
|
||
**Canonical form selection.** After per-page extraction, group textual spans by their character-level fingerprint: the sequence of Unicode general categories and ASCII characters, with non-ASCII collapsed to a category placeholder. Within each group, select the canonical form as the extraction with the highest summed confidence score; if confidence is tied, use the most frequent string. Write the canonical form to a `HashMap<SpanFingerprint, Arc<str>>`.
|
||
|
||
**Normalization pass.** In a second pass, any span whose extracted string differs from the canonical form for its fingerprint is replaced with the canonical form and its `reconstruction_method` set to `CrossSpan`. The replacement is only performed when the edit distance between the variant and the canonical form is below a threshold (default: 15% of canonical length), preventing spurious merging of genuinely distinct terms.
|
||
|
||
---
|
||
|
||
## 6. Abbreviation Expansion
|
||
|
||
Abbreviations break sentence boundary detection, inflate vocabulary, and reduce readability. The reconstruction layer handles three kinds:
|
||
|
||
**Standard abbreviations.** A trie-based lookup (`AhoCorasick` automaton from the `aho-corasick` crate) matches spans against a compiled list of standard abbreviations ("e.g.", "i.e.", "cf.", "op. cit.", "et al."). On a match, the span is tagged with its expansion in the output metadata; the `text` field retains the abbreviated form by default (expansion is opt-in via `ReconstructorConfig { expand_abbreviations: bool }`).
|
||
|
||
**Document-internal definitions.** The pattern `<full form> (<short form>)` defines a document-local abbreviation. Detected using a regex over the token stream: `\b([A-Z][a-z]+(?: [A-Z][a-z]+)+)\s+\(([A-Z]{2,8})\)`. On detection, insert into a per-document `HashMap<String, String>` mapping short form to long form. All subsequent occurrences of the short form are expanded using this map, with `reconstruction_method` set to `DocumentAbbrev`.
|
||
|
||
**Ambiguous period handling.** A period following a known abbreviation is not a sentence boundary. This table is shared between the abbreviation expander and the sentence boundary detector.
|
||
|
||
---
|
||
|
||
## 7. Reference and Citation Reconstruction
|
||
|
||
Bibliography sections have distinct extraction failure patterns: author names are frequently reordered or truncated, journal titles run together with volume numbers, and DOIs are broken by line-wrap hyphenation.
|
||
|
||
**Zone detection.** The bibliography zone is identified by a combination of signals: high density of year-pattern tokens (four-digit sequences in the range 1900–2050), high density of capitalized multi-word sequences, and a section header matching a list of known bibliography heading strings. Once identified, reconstruction rules specific to the reference zone are applied.
|
||
|
||
**Structural validation.** A DOI must match `10\.\d{4,9}/[-._;()/:A-Za-z0-9]+`. An ISSN must match `\d{4}-\d{3}[\dX]`. A year must parse as an integer in the range 1700–2100. When a span in the reference zone partially matches one of these patterns but contains obvious character substitutions (e.g., `l0.1038/...` where `l0` should be `10`), apply a targeted correction using the pattern as a template.
|
||
|
||
**DOI normalization.** Hyphens inserted by PDF line-wrapping inside a DOI are detected by splitting on hyphen and checking whether reassembly yields a valid DOI regex match. If so, remove the hyphen.
|
||
|
||
---
|
||
|
||
## 8. Sentence Boundary Detection
|
||
|
||
Periods are the most ambiguous character in prose text. A period may end a sentence, terminate an abbreviation, separate decimal digits, form an ellipsis, or appear inside a URL or DOI. A rule-based sentence boundary detector resolves ambiguity in order:
|
||
|
||
1. If the preceding token is in the abbreviation table, the period is not a sentence boundary.
|
||
2. If the preceding token is a single uppercase letter (initials), the period is not a sentence boundary.
|
||
3. If the following token begins with a lowercase letter, the period is not a sentence boundary.
|
||
4. If the period is followed by another period (ellipsis), it is not a sentence boundary.
|
||
5. Otherwise, the period is a sentence boundary.
|
||
|
||
**PDF-specific complications.** A line break in a PDF content stream does not imply a sentence break. After geometry-based line joining (handled in an earlier normalization stage), the sentence detector operates on joined paragraphs, not raw lines. Hyphenated line-end tokens that have been rejoined must not present a spurious word boundary to the detector.
|
||
|
||
---
|
||
|
||
## 9. Coherence Scoring for Reconstruction Candidates
|
||
|
||
When multiple reconstructions of a passage are plausible, select the best using a composite score:
|
||
|
||
- **(a) Word n-gram perplexity.** Lower perplexity is better. Computed using a trigram model with Kneser-Ney smoothing.
|
||
- **(b) Part-of-speech sequence probability.** Tag the candidate with a fast POS tagger (e.g., averaged perceptron); score the POS sequence under a bigram POS language model. Promotes syntactically coherent candidates.
|
||
- **(c) Entity consistency.** Count entity type conflicts between the candidate and the surrounding 200-token context window. A candidate that introduces an unexpected entity type (e.g., a location in a paragraph that is otherwise about persons) is penalized.
|
||
- **(d) Semantic similarity.** Encode the candidate and the surrounding paragraph using a compact sentence embedding model (e.g., a 4-layer distilled transformer, ~25 MB) and compute cosine similarity. Candidates that are semantically distant from their context are penalized.
|
||
|
||
The composite score is a weighted sum: `score = w_ppl * ppl + w_pos * pos_cost + w_entity * entity_penalty - w_sem * cos_sim`. Weights are configurable; default values are calibrated on a mixed-domain PDF corpus.
|
||
|
||
---
|
||
|
||
## 10. Output and Confidence
|
||
|
||
Each reconstructed span in the `pdftract` output carries the following fields:
|
||
|
||
```rust
|
||
pub struct ReconstructedSpan {
|
||
/// Raw bytes as decoded before reconstruction, percent-encoded if non-UTF-8.
|
||
pub original_raw: String,
|
||
/// Final reconstructed string, normalized to NFC.
|
||
pub text: String,
|
||
/// True if any reconstruction algorithm modified `text` relative to `original_raw`.
|
||
pub reconstruction_applied: bool,
|
||
/// Confidence in the reconstructed text, in [0.0, 1.0].
|
||
pub reconstruction_confidence: f32,
|
||
/// The primary algorithm responsible for the reconstruction.
|
||
pub reconstruction_method: ReconstructionMethod,
|
||
}
|
||
|
||
pub enum ReconstructionMethod {
|
||
None,
|
||
Dictionary,
|
||
Ngram,
|
||
Entity,
|
||
CrossSpan,
|
||
DomainLexicon,
|
||
DocumentAbbrev,
|
||
}
|
||
```
|
||
|
||
**Page-level metrics.** Each `ExtractedPage` carries a `reconstruction_rate: f32` — the fraction of spans on that page for which `reconstruction_applied` is true. A page with `reconstruction_rate > 0.3` should be flagged in the caller's output as a low-quality extraction, potentially warranting an OCR fallback. The `ReconstructionMethod` distribution across a page (accessible via `page.reconstruction_method_histogram()`) gives the caller a breakdown of *why* reconstruction was applied, which is useful for diagnosing systematic problems in a PDF batch.
|
||
|
||
When `reconstruction_applied` is false, `reconstruction_confidence` reflects the confidence of the original extraction, not of the reconstruction pass. This preserves the meaning of the field: it always represents the extractor's confidence in `text`, not in the decision to reconstruct.
|