Add four research documents on text quality and document-type handling
- text-readability-validation: character/word/entropy/perplexity checks, symbol font detection, remediation decision tree, span quality metadata - post-ocr-text-correction: error taxonomy, confusable tables, noisy channel n-gram model, regex patterns, hyphenation, layout-based correction pipeline - presentation-and-spreadsheet-pdfs: detection heuristics, slide structure, bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries, cell type inference, Rust output schema - semantic-text-reconstruction: beam search n-gram reconstruction, NER validation, domain lexicons, cross-span consistency, abbreviation expansion, citation repair, coherence scoring, ReconstructedSpan output schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
a7673c906f
commit
31e715633d
3 changed files with 555 additions and 0 deletions
201
docs/research/post-ocr-text-correction.md
Normal file
201
docs/research/post-ocr-text-correction.md
Normal file
|
|
@ -0,0 +1,201 @@
|
|||
# Post-OCR Text Correction
|
||||
|
||||
Even after a careful font decoding and OCR pipeline, extracted text carries residual errors. Some are systematic — a misencoded font maps the same glyph to the wrong Unicode point on every page. Others are stochastic — noise in a scanned image tips the classifier toward a wrong character. Still others arise from the document structure itself: merged ligatures, broken hyphenation, swapped reading order. Correcting these errors requires layered strategies, applied in the right sequence, with full traceability.
|
||||
|
||||
---
|
||||
|
||||
## 1. Classes of Errors
|
||||
|
||||
**Systematic errors** stem from a consistent encoding mismatch or a trained bias in the OCR classifier. Every occurrence of a glyph produces the same wrong character. Example: a font's `/f_i` ligature slot maps to `fi` but the ToUnicode CMap is absent, so all occurrences render as `fi` (U+FB01) or, worse, as two separate characters with a wrong split point. These are highly correctable once the pattern is identified.
|
||||
|
||||
**Random errors** are noise-induced. The classifier assigns a plausible but wrong character with some probability per token. `e` → `c`, `h` → `b`, `a` → `o`. Distribution is roughly uniform over visually similar character pairs; no single substitution dominates.
|
||||
|
||||
**Context errors** place the right characters in the wrong positions within a word: `teh` for `the`, `recieve` for `receive`. The characters are individually plausible, just misordered.
|
||||
|
||||
**Deletion errors** drop a character: `hous` for `house`. Common at word edges where ink density drops near the margin.
|
||||
|
||||
**Insertion errors** introduce a spurious character: `hoouse`. Often from double-ink artifacts or speckling.
|
||||
|
||||
**Transposition errors** swap adjacent characters. Distinct from context errors in that only two positions are affected.
|
||||
|
||||
Each class responds to a different primary corrector. Systematic errors are best caught by confusable substitution tables. Random errors need dictionary + language model ranking. Context and transposition errors benefit from sequence-level Viterbi correction. Deletion and insertion errors are handled by edit-distance candidate generation.
|
||||
|
||||
---
|
||||
|
||||
## 2. Dictionary-Based Correction
|
||||
|
||||
For each extracted token that fails a dictionary lookup, generate correction candidates within edit distance 1 or 2 using the four standard operations: single-character substitution, deletion, insertion, and transposition. At edit distance 1, the candidate space is `O(|alphabet| * |word|)` substitutions plus `O(|word|)` deletions, `O(|alphabet| * |word|)` insertions, and `O(|word|)` transpositions — manageable. At edit distance 2, enumerate by composing two distance-1 steps; prune candidates not in the dictionary.
|
||||
|
||||
Rank candidates by a composite score:
|
||||
|
||||
```
|
||||
score(candidate) = w_ed * edit_distance(token, candidate)
|
||||
- w_freq * log P(candidate) // corpus frequency
|
||||
- w_vis * visual_confusion_cost(token, candidate)
|
||||
```
|
||||
|
||||
`visual_confusion_cost` is low when the substitution involves a known confusable pair (see section 3), high otherwise. This means `l` → `1` is cheaper than `l` → `z`.
|
||||
|
||||
**When to correct vs. flag:** Apply a correction automatically when the top candidate's score exceeds a confidence threshold and the edit distance is 1. For distance-2 corrections or low-frequency candidates, emit a flagged suggestion instead. Proper nouns, domain terms, and tokens containing digits are better flagged than blindly corrected.
|
||||
|
||||
---
|
||||
|
||||
## 3. Confusable Character Tables
|
||||
|
||||
Visual confusion in OCR and glyph decoding follows well-known patterns. Build a weighted directed graph where an edge `a → b` carries the empirical probability that OCR produces `b` when the true character is `a`.
|
||||
|
||||
Key confusable pairs:
|
||||
|
||||
| OCR output | True character | Notes |
|
||||
|------------|---------------|-------|
|
||||
| `0` | `O` / `o` | Circular glyph; context (numeric vs. alpha) disambiguates |
|
||||
| `1` | `l` / `I` | Vertical stroke; `I` lacks serifs in sans-serif fonts |
|
||||
| `5` | `S` | Curved top; common in low-resolution scans |
|
||||
| `6` | `b` | Mirrored loop; rare but appears in degraded text |
|
||||
| `8` | `B` | Closed loops; highly context-dependent |
|
||||
| `rn` | `m` | Two-stroke sequence visually merges |
|
||||
| `cl` | `d` | Left stroke + ascender confusion |
|
||||
| `ii` | `u` | Dot placement ambiguity |
|
||||
| `vv` | `w` | Double-stroke width |
|
||||
| `fi` ligature | `fi` | CMap absent; also appears as `f1` |
|
||||
|
||||
Rather than exhaustive edit-distance enumeration, use the confusable graph to generate targeted candidates first. For each token, walk the graph: for every character (or digraph) in the token, emit a candidate with the confusable substitution applied. These targeted candidates receive a lower visual cost penalty in the ranking formula, producing higher final scores and reducing false positives relative to generic edit-distance candidates.
|
||||
|
||||
Digraph confusables (`rn` → `m`) require special handling: the substitution changes the token length. Track this as a deletion-substitution pair rather than a single-character operation.
|
||||
|
||||
---
|
||||
|
||||
## 4. Context-Aware Correction with N-gram Language Models
|
||||
|
||||
Single-token correction ignores context. The word `sail` and `tail` are both valid English; only the surrounding words disambiguate. A bigram or trigram language model provides a prior over word sequences.
|
||||
|
||||
**Noisy channel model:** The corrected sequence `W*` maximizes:
|
||||
|
||||
```
|
||||
W* = argmax_W P(W) * P(T | W)
|
||||
```
|
||||
|
||||
where `P(W)` is the language model prior and `P(T | W)` is the channel model — the probability that the OCR system produced the observed token sequence `T` given the true text `W`. The channel model is estimated from character-level confusable probabilities: for each position, what is the probability that the true character was substituted, deleted, or inserted to yield the observed character?
|
||||
|
||||
**Viterbi algorithm for sequence correction:** Model the correction problem as a hidden Markov model. States are candidate words at each position; transitions are bigram probabilities; emissions are channel model probabilities. Viterbi finds the maximum-probability path in `O(N * K^2)` time where `N` is the token count and `K` is the candidate count per position. In practice, prune candidates to the top 5–10 per position before running Viterbi to keep runtime acceptable.
|
||||
|
||||
For trigram models, use a beam search over the lattice rather than exact Viterbi; a beam width of 20–50 balances accuracy against throughput.
|
||||
|
||||
**Implementation note for Rust:** Use a pre-built binary n-gram model stored as a hash map from `(word_1, word_2)` to log-probability. A good corpus baseline is a count-based model built from Wikipedia or CommonCrawl, smoothed with Kneser-Ney discounting. Serialize the model as a flat binary file and memory-map it at startup; the hash map itself fits in roughly 2–4 GB for a 3-gram model covering 300k vocabulary.
|
||||
|
||||
---
|
||||
|
||||
## 5. Regex-Based Pattern Correction
|
||||
|
||||
Domain-specific OCR garbles follow predictable patterns that regex handles efficiently, before dictionary lookup ever runs.
|
||||
|
||||
Key patterns to encode:
|
||||
|
||||
- **Month names:** `1anuary` → `January`, `0ctober` → `October`. Match `[0-9][a-z]{4,8}` at a word boundary; check the digit-substituted form against the 12 month names.
|
||||
- **Year fields:** `20l8`, `l998`. Match `[12][0-9l][0-9l][0-9l]` and apply `l` → `1` within the match.
|
||||
- **Large numbers:** `l00,000`, `1,23l,456`. Match sequences of digits, commas, and the letter `l` or `O` surrounded by digits; apply digit confusable substitution within the match.
|
||||
- **Currency amounts:** `$l.5B1llion`. Tokenize currency prefix + numeric + scale suffix; correct each segment independently.
|
||||
- **Email addresses:** Validate structure `local@domain.tld`; apply confusable correction only within `local` and `domain` segments, preserving `@` and `.`.
|
||||
- **URLs:** Match `https?://[^\s]+`; within the matched span, map `O` → `0` and `l` → `1` only when surrounded by other digits.
|
||||
|
||||
Regex corrections are deterministic and produce no confidence ambiguity. Apply them before any probabilistic corrector to reduce the token space the language model must consider.
|
||||
|
||||
---
|
||||
|
||||
## 6. Hyphenation Artifact Removal
|
||||
|
||||
PDF text extraction frequently encounters end-of-line hyphens that are typographic (indicating a broken word) rather than semantic (indicating a compound). OCR inherits this artifact from the rasterized layout.
|
||||
|
||||
Detection pattern: a span ending in `-` is the last span on a line, and the next line begins with a lowercase token. Steps:
|
||||
|
||||
1. Strip the trailing `-` from the first fragment and concatenate with the second fragment.
|
||||
2. Look up the concatenated form in the dictionary.
|
||||
3. If found, replace both fragments with the single joined token and update the bounding box to span both original boxes.
|
||||
4. If not found, apply the Liang-Knuth hyphenation algorithm in reverse: verify that the hyphenation point is a valid break point for the concatenated word. Accept the join if it is; retain the hyphen if it is not.
|
||||
|
||||
Preserve explicit hyphens in compound words (`self-referential`) by requiring that the candidate concatenation exists in the dictionary as a non-hyphenated form. When the dictionary is ambiguous, emit both forms as alternatives with confidence scores.
|
||||
|
||||
---
|
||||
|
||||
## 7. Number and Unit Normalization
|
||||
|
||||
Numeric regions in tables and financial text are high-value targets for correction. Apply digit-specific confusable correction after detecting numeric context.
|
||||
|
||||
**Numeric context detection:** A span is numeric if it matches `[0-9OolIl,. ]+` and is adjacent to currency symbols (`$`, `€`, `£`, `¥`) or unit suffixes (`%`, `kg`, `mm`, `MHz`). Within a detected numeric span, apply the digit-confusable map: `O` → `0`, `l` → `1`, `I` → `1`, `S` → `5`, `B` → `8`.
|
||||
|
||||
**Locale-aware separator handling:** In US/UK locale, comma is the thousands separator and period is the decimal point. In many European locales the roles are reversed. Detect locale from document metadata or from the pattern of separators in the numeric span (a period followed by exactly three digits before another separator is a thousands separator, not a decimal). Apply locale-consistent normalization before emitting the corrected token.
|
||||
|
||||
**Formatting preservation:** After correction, re-emit the number in its original format (including comma grouping and decimal precision), not as a normalized float. Callers that need numeric values should parse the corrected string; callers that need the original text appearance should receive the corrected but format-preserving string.
|
||||
|
||||
---
|
||||
|
||||
## 8. Structural Correction Using Layout
|
||||
|
||||
Bounding box metadata enables a class of corrections that pure text analysis cannot reach.
|
||||
|
||||
**Reading-order transposition:** When the PDF rendering order does not match visual left-to-right order, extracted words on the same line appear in the wrong sequence. Detect this by comparing `x_min` values of consecutive spans on the same baseline. If span `n` has `x_min > x_min` of span `n+1` but they share the same `y` band, the spans are out of order. Swap them and record a `PositionTransposition` correction.
|
||||
|
||||
**Duplicate word detection:** Double-rendering artifacts (a glyph painted twice at slightly offset coordinates) produce duplicate tokens in the same position. Detect pairs of spans with identical text and bounding boxes that overlap by more than 80% of their area. Discard the duplicate; record a `DuplicateRemoval` correction.
|
||||
|
||||
**Column boundary errors:** In multi-column layouts, OCR occasionally merges the rightmost word of column 1 with the leftmost word of column 2. Detect by checking whether a token's bounding box crosses a detected column boundary. Split at the boundary and re-tokenize each part.
|
||||
|
||||
---
|
||||
|
||||
## 9. Correction Pipeline Ordering
|
||||
|
||||
The order of correction stages determines correctness. A wrong order produces cascading errors.
|
||||
|
||||
```
|
||||
1. Character-level encoding recovery (font CMap / ToUnicode pipeline)
|
||||
2. Structural layout correction (position transpositions, duplicates)
|
||||
3. Regex patterns for known formats (dates, currencies, URLs)
|
||||
4. Digit confusable substitution in numeric regions
|
||||
5. Confusable graph candidates for systematic glyph errors
|
||||
6. Dictionary lookup + edit-distance candidates
|
||||
7. N-gram context scoring (Viterbi or beam search)
|
||||
8. Hyphenation artifact joining
|
||||
9. Number and unit normalization (final formatting pass)
|
||||
```
|
||||
|
||||
**Why this order matters:**
|
||||
|
||||
- Dictionary lookup before numeric correction causes `l23` to be matched as a candidate for `lez`, `leg`, or similar — the numeric intent is invisible to the word-level corrector.
|
||||
- Hyphenation joining after dictionary lookup ensures the joined form is verified against the same dictionary already loaded in memory.
|
||||
- N-gram scoring after confusable expansion gives the language model a rich but targeted candidate set rather than generic edit-distance noise.
|
||||
- Structural corrections must precede all text-level corrections; swapped spans in the wrong order produce nonsense input to every downstream stage.
|
||||
|
||||
---
|
||||
|
||||
## 10. Correction Metadata
|
||||
|
||||
Every correction must be traceable. Expose correction records on each span:
|
||||
|
||||
```rust
|
||||
pub struct Correction {
|
||||
pub original: String,
|
||||
pub corrected: String,
|
||||
pub correction_type: CorrectionType,
|
||||
pub confidence: f32, // 0.0–1.0
|
||||
pub span_index: usize,
|
||||
}
|
||||
|
||||
pub enum CorrectionType {
|
||||
ConfusableSubstitution,
|
||||
DictionaryReplacement,
|
||||
RegexPattern(String), // pattern name
|
||||
HyphenJoin,
|
||||
NumberNormalization,
|
||||
PositionTransposition,
|
||||
DuplicateRemoval,
|
||||
}
|
||||
```
|
||||
|
||||
Each `Span` in the extraction result carries a `corrections: Vec<Correction>` field. Page metadata includes a `correction_count: usize` and a `low_confidence_count: usize` (corrections with `confidence < 0.6`) for callers that want a quick quality signal without walking every span.
|
||||
|
||||
**Caller policy options:** The extraction API should allow callers to configure a `CorrectionPolicy`:
|
||||
|
||||
- `AutoAccept` — apply all corrections above a confidence threshold in the output text.
|
||||
- `FlagOnly` — return original text with corrections annotated but not applied.
|
||||
- `ReviewThreshold(f32)` — auto-accept high-confidence corrections, flag the rest.
|
||||
|
||||
This lets a downstream LLM pipeline receive clean text directly while a human-review pipeline sees the original alongside the suggestions. The correction metadata is sufficient to reconstruct either representation from the other without re-running extraction.
|
||||
201
docs/research/presentation-and-spreadsheet-pdfs.md
Normal file
201
docs/research/presentation-and-spreadsheet-pdfs.md
Normal file
|
|
@ -0,0 +1,201 @@
|
|||
# Presentation and Spreadsheet PDFs
|
||||
|
||||
## Overview
|
||||
|
||||
PDFs produced by presentation tools (PowerPoint, Keynote, Google Slides) and spreadsheet tools (Excel, Google Sheets, LibreOffice Calc) are structurally distinct from document PDFs. They share a common deficiency: neither was designed for linear reading. A presentation arranges text for visual impact across a large canvas; a spreadsheet arranges text for data inspection across a grid. Generic extraction — concatenating text in top-to-bottom, left-to-right scan order — produces unusable output for both types. This document describes the structural characteristics of each, detection heuristics, and extraction algorithms suited to each.
|
||||
|
||||
---
|
||||
|
||||
## 1. Presentation PDF Characteristics
|
||||
|
||||
Presentation PDFs exhibit a consistent set of structural traits regardless of authoring tool.
|
||||
|
||||
**Page geometry.** Slides are exported at fixed aspect ratios. The traditional 4:3 ratio maps to 10×7.5 inches at 72 dpi (720×540 pt). Widescreen 16:9 maps to 13.33×7.5 inches (960×540 pt) or 10×5.625 inches (720×405 pt) depending on the application. A page whose width/height ratio is within 2% of 4/3 (1.333) or 16/9 (1.777) is a strong presentation signal.
|
||||
|
||||
**Text density.** Slides carry very little text relative to page area. A typical body-text PDF contains 500–1500 characters per square inch of text area. A slide may contain 40–200 characters across the entire page. Characters-per-square-point (csp) below roughly 0.08 is a reliable low-density indicator; the exact threshold should be tuned against a corpus.
|
||||
|
||||
**Font sizes.** Title text is typically 28–54pt. Body bullets are 18–32pt. Captions and fine print may drop to 12pt but rarely lower. The median font size across all glyph runs on a slide is almost always above 18pt. A document with median font size above 18pt and low character density is almost certainly a presentation.
|
||||
|
||||
**Short, disconnected text runs.** Each text box is an independent content stream fragment. Unlike document paragraphs, slide text boxes are spatially isolated and not connected by semantic flow. A single page may contain 4–12 discrete text clusters with large whitespace gaps between them. Measuring the ratio of whitespace area to glyph-bounding-box area across the page gives a sparsity coefficient; values above 0.80 are characteristic of slides.
|
||||
|
||||
**Heavy XObject usage.** Slides embed many images, icons, and vector graphics as XObjects. A page with more than three Form or Image XObjects and fewer than 300 glyphs is likely a slide. Decorative background shapes — filled rectangles, gradient regions, logos — are rendered as graphics, not text.
|
||||
|
||||
**No reading flow.** Text on a slide is positioned for visual composition, not for sequential reading. There is no implicit reading order between text boxes. The spatial sequence in which text appears in the content stream (painting order) is irrelevant to semantic order.
|
||||
|
||||
---
|
||||
|
||||
## 2. Detecting Presentation PDFs
|
||||
|
||||
Detection operates at two levels: document metadata and per-page geometry.
|
||||
|
||||
**Producer metadata.** The `/Producer` entry in the document's `/Info` dictionary and the `pdf:Producer` / `xmp:CreatorTool` fields in XMP metadata identify the authoring application. Relevant substrings to match (case-insensitive):
|
||||
|
||||
- `"Microsoft PowerPoint"` — PowerPoint on Windows/macOS
|
||||
- `"Keynote"` — Apple Keynote
|
||||
- `"Google Slides"` — Google Slides via Chromium-based export
|
||||
- `"LibreOffice Impress"` — LibreOffice Impress
|
||||
|
||||
A metadata match alone is sufficient to set `document_type = "presentation"` with high confidence, though page-level heuristics should still run to detect mixed-type documents.
|
||||
|
||||
**Page-level heuristics.** When metadata is absent or ambiguous, aggregate the following signals across all pages:
|
||||
|
||||
1. `aspect_ratio_score`: fraction of pages whose width/height ratio is within 0.03 of 4/3 or 16/9.
|
||||
2. `low_density_score`: fraction of pages with character density below 0.08 csp.
|
||||
3. `large_font_score`: fraction of pages with median glyph font size above 18pt.
|
||||
4. `sparse_text_score`: fraction of pages with more than 4 discrete text clusters and fewer than 300 total glyphs.
|
||||
5. `xobject_score`: fraction of pages with XObject count exceeding glyph run count.
|
||||
|
||||
Combine scores with weights (e.g., aspect ratio 0.35, large font 0.25, low density 0.20, sparse text 0.15, xobject 0.05). A weighted sum above 0.60 triggers presentation mode. Store the raw score as `detection_confidence` in output metadata.
|
||||
|
||||
---
|
||||
|
||||
## 3. Slide Structure Extraction
|
||||
|
||||
Once a page is identified as a slide, text runs are classified into roles: `title`, `subtitle`, `bullet`, `caption`, `decorative`.
|
||||
|
||||
**Title identification.** Among all text runs on the page, select the run with the largest font size. If multiple runs share the largest size, prefer the topmost (highest y-coordinate in PDF space, i.e., lowest y value if origin is bottom-left). The title run is almost always within the top 30% of the slide height. A run whose bounding box top exceeds 40% of page height is unlikely to be a title regardless of font size.
|
||||
|
||||
**Bullet detection.** Runs with font size 0.55–0.85× the title font size, located below the title box, are body bullet candidates. Within a bullet cluster, hierarchical levels are inferred from two signals:
|
||||
|
||||
- **X-indent offset**: each level adds a consistent horizontal indent, typically 12–24pt per level. Compute the leftmost x-coordinate of each run; runs that share a left-edge (within 2pt tolerance) belong to the same level. Runs indented further are child levels.
|
||||
- **Font size reduction**: level 2 bullets are often 2–4pt smaller than level 1. Track font size alongside indent to resolve ambiguous cases.
|
||||
|
||||
Bullet markers (•, –, ▸, numerals followed by `.` or `)`) should be detected and stripped from the text content but recorded in the `bullet_marker` field to allow downstream reconstruction.
|
||||
|
||||
**Decorative text filtering.** Text that meets any of the following criteria is marked `decorative` and excluded from the logical output:
|
||||
|
||||
- Single Unicode characters in the Private Use Area or Wingdings/Symbol encoding (icon fonts).
|
||||
- Font size below 8pt (watermarks, slide number labels in corners).
|
||||
- Bounding box overlapping a large filled rectangle or image XObject (background text).
|
||||
- Opacity below 0.30 as set by the graphics state `ca`/`CA` operators.
|
||||
|
||||
---
|
||||
|
||||
## 4. Text Box Reading Order for Slides
|
||||
|
||||
Slides have no canonical reading order. The content-stream painting order reflects z-ordering (background to foreground), not reading sequence. A viable reading-order heuristic:
|
||||
|
||||
1. Assign the title run rank 0.
|
||||
2. For remaining non-decorative runs, compute a sort key: `sort_key = (y_band * 1000) + x_position`, where `y_band = floor(y_center / (page_height * 0.15))`. This groups runs into horizontal bands of 15% page height each, then sorts left-to-right within a band.
|
||||
3. The title always leads; bands are ordered top-to-bottom.
|
||||
|
||||
When two text boxes overlap (their bounding rectangles intersect), prefer the one with larger font size as earlier in reading order. If font sizes match, prefer the one with greater area.
|
||||
|
||||
---
|
||||
|
||||
## 5. Speaker Notes
|
||||
|
||||
PowerPoint and Keynote support per-slide speaker notes. PDF export behavior varies:
|
||||
|
||||
**Notes Pages layout.** Some exporters include notes by appending a "Notes Page" after each slide — a second page (or second half of a landscape-split page) containing the slide thumbnail in the top half and notes text in the bottom half. Detection: if a page's height is approximately 2× the width (portrait, matching a 4:3 landscape slide stacked), the bottom half below the midpoint likely contains notes text.
|
||||
|
||||
**In-page notes region.** Some exporters render notes in a visually distinct region on the same page: smaller font (typically 10–12pt), wider margins, and often a thin horizontal rule separating it from the slide content. Detect by: font size drop below 14pt in a run cluster located in the bottom 35% of the page, with horizontal extent spanning more than 70% of page width (wider than typical slide content).
|
||||
|
||||
**Labeling.** Notes text must be extracted separately from slide content and tagged `role: "notes"` in the output. Notes should not participate in bullet hierarchy reconstruction or reading-order sorting.
|
||||
|
||||
---
|
||||
|
||||
## 6. Spreadsheet PDF Characteristics
|
||||
|
||||
Spreadsheet PDFs are visually dominated by a regular grid of cells. Characteristic traits:
|
||||
|
||||
**Dense, small text.** Cell content is typically 8–11pt. Character density is very high — often 0.4–1.2 csp, comparable to dense body text but distributed uniformly across the page rather than in paragraph blocks.
|
||||
|
||||
**Thin border lines.** Cell borders are hairline rules: 0.25–0.5pt stroke width, drawn as horizontal and vertical path segments forming a grid. These are far thinner than the ruled lines typical of document tables (usually 0.75–1.5pt). Stroke width below 0.5pt is a strong spreadsheet indicator.
|
||||
|
||||
**Cell alignment patterns.** Number columns are right-aligned; label columns are left-aligned; headers are often centered. This alignment is consistent within a column across all rows — a much stronger regularity than in document tables.
|
||||
|
||||
**Multi-sheet exports.** Excel and LibreOffice Calc export multiple sheets as page sequences. Each sheet's pages share a running header or footer containing the sheet name. Sheet boundaries are not otherwise marked in the PDF structure.
|
||||
|
||||
---
|
||||
|
||||
## 7. Spreadsheet Table Extraction
|
||||
|
||||
The grid-detection algorithm from general table extraction applies, but with calibration specific to spreadsheet hairlines.
|
||||
|
||||
**Grid construction.** Collect all horizontal and vertical line segments with stroke width ≤ 0.5pt. Cluster horizontals by y-coordinate (tolerance 1pt) and verticals by x-coordinate (tolerance 1pt). The resulting grid defines cell bounding boxes as the rectangles formed by adjacent horizontal and vertical pairs.
|
||||
|
||||
**Merged cell detection.** A merged cell is identified by the absence of an interior grid line where one would be expected. For a cell spanning columns c1 through c2, the vertical line at x-coordinate between c1 and c2 is missing in the row range occupied by the merged cell. Build the full grid skeleton and flag every missing interior segment; the corresponding cell region is a merge candidate, confirmed if a single text run's bounding box spans the merged region.
|
||||
|
||||
**Multi-line cell content.** A cell may contain line-wrapped text. Multiple glyph runs within the same cell bounding box, at different y-coordinates, belong to the same cell. Concatenate with a space or newline depending on whether the runs' baselines differ by more than 1.2× the font size (hard wrap) or less (soft wrap from kerning artifacts).
|
||||
|
||||
---
|
||||
|
||||
## 8. Sheet Boundaries in Multi-Sheet Exports
|
||||
|
||||
Running headers in spreadsheet PDFs typically contain the sheet name, the file name, or both. Detection algorithm:
|
||||
|
||||
1. Extract all text in the top 8% and bottom 8% of each page (header/footer zones).
|
||||
2. Collect unique header strings across all pages; the string that changes between page groups is the sheet name candidate.
|
||||
3. Group consecutive pages sharing an identical header string into a sheet run.
|
||||
4. The first occurrence of a new header string marks a sheet boundary.
|
||||
|
||||
If no header is present, fall back to detecting a column-count change or a significant shift in the leftmost column x-position between consecutive pages.
|
||||
|
||||
---
|
||||
|
||||
## 9. Data Type Inference for Spreadsheet Cells
|
||||
|
||||
After cell text is extracted, infer the data type of each cell:
|
||||
|
||||
- **Integer**: matches `^-?\d{1,3}(,\d{3})*$` or `^-?\d+$` (locale-aware thousands separator).
|
||||
- **Float**: matches `^-?\d+[.,]\d+$` after normalizing decimal separator.
|
||||
- **Currency**: leading or trailing currency symbol (`$`, `€`, `£`, `¥`) with numeric body; strip symbol, parse as float.
|
||||
- **Percentage**: trailing `%`; parse the numeric body and store as a float in [0, 1] (divide by 100).
|
||||
- **Date**: attempt parsing against a priority list: ISO 8601 (`YYYY-MM-DD`), US (`M/D/YYYY`), EU (`D.M.YYYY`), short year variants. Store as an ISO 8601 string.
|
||||
- **Boolean**: exact match against `TRUE`/`FALSE`, `Yes`/`No`, `✓`/`✗`, `1`/`0` (in cells where the column appears boolean-dominant).
|
||||
- **Text**: fallback for anything not matched above.
|
||||
|
||||
Locale inference: if more than 30% of numeric cells use a comma as the decimal separator (values like `1.234,56`), set `locale = "eu"` and swap separator roles before parsing.
|
||||
|
||||
---
|
||||
|
||||
## 10. Output Representation
|
||||
|
||||
### Presentation Output
|
||||
|
||||
```
|
||||
PresentationDocument {
|
||||
document_type: "presentation",
|
||||
detection_confidence: f32, // 0.0–1.0
|
||||
producer: Option<String>,
|
||||
slides: Vec<Slide>,
|
||||
}
|
||||
|
||||
Slide {
|
||||
slide_number: u32,
|
||||
title: Option<String>,
|
||||
subtitle: Option<String>,
|
||||
bullets: Vec<Bullet>, // hierarchical
|
||||
body_text: Vec<String>, // non-bullet body runs
|
||||
notes: Option<String>,
|
||||
}
|
||||
|
||||
Bullet {
|
||||
level: u8, // 0 = top level
|
||||
marker: Option<String>, // bullet character or numeral
|
||||
text: String,
|
||||
children: Vec<Bullet>,
|
||||
}
|
||||
```
|
||||
|
||||
### Spreadsheet Output
|
||||
|
||||
```
|
||||
SpreadsheetDocument {
|
||||
document_type: "spreadsheet",
|
||||
detection_confidence: f32,
|
||||
producer: Option<String>,
|
||||
sheets: Vec<Sheet>,
|
||||
}
|
||||
|
||||
Sheet {
|
||||
sheet_name: Option<String>,
|
||||
page_range: Range<u32>, // 0-indexed page numbers
|
||||
table: Table, // reuses table schema from table-structure-reconstruction
|
||||
}
|
||||
```
|
||||
|
||||
The `Table` type is defined in `table-structure-reconstruction` and carries rows, cells, column spans, and merge annotations. Each `Cell` gains an additional `inferred_type` field (`CellType` enum: `Integer`, `Float`, `Currency`, `Percentage`, `Date`, `Boolean`, `Text`) populated by the data type inference pass.
|
||||
|
||||
The top-level `document_type` field uses the discriminant `"presentation" | "spreadsheet" | "document" | "form" | "mixed"`. A `"mixed"` classification applies when page-level heuristics disagree across more than 20% of pages — for example, a document that embeds a slide or a report that opens with a data table. In the mixed case, per-page classification is stored in a `page_classifications` array alongside the top-level type.
|
||||
153
docs/research/semantic-text-reconstruction.md
Normal file
153
docs/research/semantic-text-reconstruction.md
Normal file
|
|
@ -0,0 +1,153 @@
|
|||
# Semantic Text Reconstruction in PDF Extraction
|
||||
|
||||
Character-level Unicode recovery and word-level normalization handle the majority of extraction errors, but a class of failures resists both: cases where the raw bytes decode to plausible characters that nonetheless form meaningless or ambiguous text. Fixing these requires semantic context — understanding what the text is *about*, not just how individual glyphs encode. This document describes the algorithms, data structures, and Rust engineering concerns for implementing a semantic reconstruction layer in `pdftract`.
|
||||
|
||||
---
|
||||
|
||||
## 1. When Semantic Reconstruction Is Needed
|
||||
|
||||
Several classes of content systematically evade character- and word-level correction:
|
||||
|
||||
**Multi-word technical terms split across encoding boundaries.** A term like "polymerase chain reaction" may straddle a font-switch boundary mid-phrase, producing one half in correct encoding and the other in a shifted code page. Each half looks like a valid English word; only the joined phrase reveals the error.
|
||||
|
||||
**Scientific names garbled by font substitution.** Italic species names (*Escherichia coli*, *Drosophila melanogaster*) are often typeset in a separate italic font with its own broken ToUnicode CMap. The substitution produces character-by-character errors that a spell-checker cannot catch because the garbled form may accidentally be a common word.
|
||||
|
||||
**Proper nouns and acronyms.** A person's name, an organization acronym, or a product name has no entry in a general dictionary. Extraction errors in these spans go undetected by dictionary lookup. NER provides the discriminating signal.
|
||||
|
||||
**Cross-lingual content.** A Latin phrase in an English document (*inter alia*, *habeas corpus*) may be typeset in a decorative font without a ToUnicode entry, producing garbled output. The correct text is not in an English dictionary, but it is in a Latin lexicon and is highly recognizable by n-gram models trained on legal Latin.
|
||||
|
||||
**Hyphenated German compounds.** Words like *Verschlüsselungsalgorithmus* or *Haftpflichtversicherung* do not appear in compact dictionaries. Hyphenated splits (*Haftpflicht-versicherung*) add ambiguity: is the hyphen intentional or a line-break artifact? Compound-aware morphological analysis is the only reliable path.
|
||||
|
||||
**Mathematical notation in prose.** A formula like "O(n log n)" contains parentheses, letters, and operators whose font encodings frequently diverge. The span must be recognized as mathematical before any correction is applied; applying word-level normalization to a formula destroys it.
|
||||
|
||||
---
|
||||
|
||||
## 2. N-gram Context Reconstruction
|
||||
|
||||
When a span is flagged as low-confidence (below a configurable `confidence_threshold: f32`), the reconstructor enumerates alternative interpretations and scores each using a language model.
|
||||
|
||||
**Character n-gram scoring.** A character 5-gram language model, trained on a representative corpus in the target language, assigns a log-probability to each candidate string. For each low-confidence character position, enumerate substitutions constrained by the *character confusable set* — characters whose glyph shapes are known to be confused by common encoding bugs (e.g., `l`↔`1`, `O`↔`0`, `rn`↔`m`, `cl`↔`d`). The confusable set is stored as a `HashMap<char, SmallVec<[char; 4]>>` for O(1) lookup.
|
||||
|
||||
**Word n-gram scoring.** After candidate character strings are generated, score them as word sequences using a compressed word bigram or trigram model (ARPA format, loaded as a finite-state acceptor). This promotes candidates that form fluent phrases over candidates that score well character-by-character but form nonsense word sequences.
|
||||
|
||||
**Beam search.** Enumerate alternatives using beam search over the character lattice of the low-confidence span. At each position, retain the top-`k` partial hypotheses by cumulative log-probability. Typical values: beam width `k = 16` for character-level search, `k = 8` after word n-gram rescoring. Wider beams improve recall at quadratic cost; the tradeoff is configurable via `ReconstructorConfig { beam_width: usize, max_span_chars: usize }`. For spans longer than `max_span_chars` (default 40), fall back to greedy decoding to bound computation.
|
||||
|
||||
**Pruning.** Before scoring, prune hypotheses that violate hard constraints: the candidate must not contain characters outside the Unicode category set observed in surrounding context; the edit distance from the raw decoded form must not exceed a per-character threshold (default 1.5 edits per 10 characters). This eliminates the combinatorial explosion from considering arbitrary substitutions.
|
||||
|
||||
---
|
||||
|
||||
## 3. Named Entity Recognition for Validation
|
||||
|
||||
A lightweight NER model — a CRF or a small transformer quantized to 8-bit integers — classifies spans as `Person`, `Organization`, `Location`, `Date`, `Number`, or `Other`. NER serves two roles: identifying *what kind of thing* a span is, and then applying entity-type-specific validation.
|
||||
|
||||
**Type-specific validation.** A span classified as `Date` must parse as a valid calendar date under the document's locale. A span classified as `Number` must be parseable as an integer, decimal, or scientific notation value. A span classified as `Organization` is checked against a pre-loaded organization lexicon. When validation fails, the span is flagged for reconstruction; when it passes, the raw extraction is accepted regardless of character-level confidence.
|
||||
|
||||
**Context-conditioned classification.** The entity classifier uses a sliding window of surrounding tokens as context. A span surrounded by financial terminology (ticker symbols, "EBITDA", "basis points") is classified as a financial entity before the span itself is inspected. This reduces false positives where a garbled span accidentally resembles a common word but is semantically a proper noun.
|
||||
|
||||
---
|
||||
|
||||
## 4. Domain-Specific Lexicons
|
||||
|
||||
A general English dictionary misses the vast majority of domain vocabulary. `pdftract` loads supplementary lexicons based on a document classification step that runs before reconstruction.
|
||||
|
||||
**Domain lexicon types:** legal (Latin maxims, case citation formats, court names), medical (ICD-10 codes, drug generic and brand names, anatomical terms, lab value abbreviations), financial (ticker symbols, CUSIP/ISIN patterns, ratio names), scientific (IUPAC chemical names, species binomials, journal abbreviations per ISO 4).
|
||||
|
||||
**Document classification trigger.** A lightweight bag-of-words classifier (multinomial naive Bayes, ~50 KB model) classifies the document into one or more domains after the first-pass extraction. Domains with posterior probability above 0.6 trigger loading the corresponding lexicon. Multiple domains are possible (a clinical trial report is both medical and statistical).
|
||||
|
||||
**Bloom filter storage.** Each domain lexicon is stored as a Bloom filter for O(1) membership queries with bounded false positive rate. A 16-bit cuckoo filter (using the `cuckoofilter` crate or a hand-rolled implementation) achieves a 0.1% false positive rate at 12 bits per entry. Term lookup: normalize the candidate string (lowercase, NFC), query the filter; on a positive hit, optionally verify against a sorted `&[u8]` slice for exact confirmation. Total storage for a 500,000-term medical lexicon is approximately 750 KB.
|
||||
|
||||
---
|
||||
|
||||
## 5. Cross-Span Consistency
|
||||
|
||||
The same visual glyph sequence must extract consistently across the document. Inconsistent extraction — where the same term appears as "photosynthesis" on page 3 and "ph0tosynthesis" on page 17 due to differing font embedding quality — is a common failure mode in multi-section PDFs assembled from separately authored chapters.
|
||||
|
||||
**Canonical form selection.** After per-page extraction, group textual spans by their character-level fingerprint: the sequence of Unicode general categories and ASCII characters, with non-ASCII collapsed to a category placeholder. Within each group, select the canonical form as the extraction with the highest summed confidence score; if confidence is tied, use the most frequent string. Write the canonical form to a `HashMap<SpanFingerprint, Arc<str>>`.
|
||||
|
||||
**Normalization pass.** In a second pass, any span whose extracted string differs from the canonical form for its fingerprint is replaced with the canonical form and its `reconstruction_method` set to `CrossSpan`. The replacement is only performed when the edit distance between the variant and the canonical form is below a threshold (default: 15% of canonical length), preventing spurious merging of genuinely distinct terms.
|
||||
|
||||
---
|
||||
|
||||
## 6. Abbreviation Expansion
|
||||
|
||||
Abbreviations break sentence boundary detection, inflate vocabulary, and reduce readability. The reconstruction layer handles three kinds:
|
||||
|
||||
**Standard abbreviations.** A trie-based lookup (`AhoCorasick` automaton from the `aho-corasick` crate) matches spans against a compiled list of standard abbreviations ("e.g.", "i.e.", "cf.", "op. cit.", "et al."). On a match, the span is tagged with its expansion in the output metadata; the `text` field retains the abbreviated form by default (expansion is opt-in via `ReconstructorConfig { expand_abbreviations: bool }`).
|
||||
|
||||
**Document-internal definitions.** The pattern `<full form> (<short form>)` defines a document-local abbreviation. Detected using a regex over the token stream: `\b([A-Z][a-z]+(?: [A-Z][a-z]+)+)\s+\(([A-Z]{2,8})\)`. On detection, insert into a per-document `HashMap<String, String>` mapping short form to long form. All subsequent occurrences of the short form are expanded using this map, with `reconstruction_method` set to `DocumentAbbrev`.
|
||||
|
||||
**Ambiguous period handling.** A period following a known abbreviation is not a sentence boundary. This table is shared between the abbreviation expander and the sentence boundary detector.
|
||||
|
||||
---
|
||||
|
||||
## 7. Reference and Citation Reconstruction
|
||||
|
||||
Bibliography sections have distinct extraction failure patterns: author names are frequently reordered or truncated, journal titles run together with volume numbers, and DOIs are broken by line-wrap hyphenation.
|
||||
|
||||
**Zone detection.** The bibliography zone is identified by a combination of signals: high density of year-pattern tokens (four-digit sequences in the range 1900–2050), high density of capitalized multi-word sequences, and a section header matching a list of known bibliography heading strings. Once identified, reconstruction rules specific to the reference zone are applied.
|
||||
|
||||
**Structural validation.** A DOI must match `10\.\d{4,9}/[-._;()/:A-Za-z0-9]+`. An ISSN must match `\d{4}-\d{3}[\dX]`. A year must parse as an integer in the range 1700–2100. When a span in the reference zone partially matches one of these patterns but contains obvious character substitutions (e.g., `l0.1038/...` where `l0` should be `10`), apply a targeted correction using the pattern as a template.
|
||||
|
||||
**DOI normalization.** Hyphens inserted by PDF line-wrapping inside a DOI are detected by splitting on hyphen and checking whether reassembly yields a valid DOI regex match. If so, remove the hyphen.
|
||||
|
||||
---
|
||||
|
||||
## 8. Sentence Boundary Detection
|
||||
|
||||
Periods are the most ambiguous character in prose text. A period may end a sentence, terminate an abbreviation, separate decimal digits, form an ellipsis, or appear inside a URL or DOI. A rule-based sentence boundary detector resolves ambiguity in order:
|
||||
|
||||
1. If the preceding token is in the abbreviation table, the period is not a sentence boundary.
|
||||
2. If the preceding token is a single uppercase letter (initials), the period is not a sentence boundary.
|
||||
3. If the following token begins with a lowercase letter, the period is not a sentence boundary.
|
||||
4. If the period is followed by another period (ellipsis), it is not a sentence boundary.
|
||||
5. Otherwise, the period is a sentence boundary.
|
||||
|
||||
**PDF-specific complications.** A line break in a PDF content stream does not imply a sentence break. After geometry-based line joining (handled in an earlier normalization stage), the sentence detector operates on joined paragraphs, not raw lines. Hyphenated line-end tokens that have been rejoined must not present a spurious word boundary to the detector.
|
||||
|
||||
---
|
||||
|
||||
## 9. Coherence Scoring for Reconstruction Candidates
|
||||
|
||||
When multiple reconstructions of a passage are plausible, select the best using a composite score:
|
||||
|
||||
- **(a) Word n-gram perplexity.** Lower perplexity is better. Computed using a trigram model with Kneser-Ney smoothing.
|
||||
- **(b) Part-of-speech sequence probability.** Tag the candidate with a fast POS tagger (e.g., averaged perceptron); score the POS sequence under a bigram POS language model. Promotes syntactically coherent candidates.
|
||||
- **(c) Entity consistency.** Count entity type conflicts between the candidate and the surrounding 200-token context window. A candidate that introduces an unexpected entity type (e.g., a location in a paragraph that is otherwise about persons) is penalized.
|
||||
- **(d) Semantic similarity.** Encode the candidate and the surrounding paragraph using a compact sentence embedding model (e.g., a 4-layer distilled transformer, ~25 MB) and compute cosine similarity. Candidates that are semantically distant from their context are penalized.
|
||||
|
||||
The composite score is a weighted sum: `score = w_ppl * ppl + w_pos * pos_cost + w_entity * entity_penalty - w_sem * cos_sim`. Weights are configurable; default values are calibrated on a mixed-domain PDF corpus.
|
||||
|
||||
---
|
||||
|
||||
## 10. Output and Confidence
|
||||
|
||||
Each reconstructed span in the `pdftract` output carries the following fields:
|
||||
|
||||
```rust
|
||||
pub struct ReconstructedSpan {
|
||||
/// Raw bytes as decoded before reconstruction, percent-encoded if non-UTF-8.
|
||||
pub original_raw: String,
|
||||
/// Final reconstructed string, normalized to NFC.
|
||||
pub text: String,
|
||||
/// True if any reconstruction algorithm modified `text` relative to `original_raw`.
|
||||
pub reconstruction_applied: bool,
|
||||
/// Confidence in the reconstructed text, in [0.0, 1.0].
|
||||
pub reconstruction_confidence: f32,
|
||||
/// The primary algorithm responsible for the reconstruction.
|
||||
pub reconstruction_method: ReconstructionMethod,
|
||||
}
|
||||
|
||||
pub enum ReconstructionMethod {
|
||||
None,
|
||||
Dictionary,
|
||||
Ngram,
|
||||
Entity,
|
||||
CrossSpan,
|
||||
DomainLexicon,
|
||||
DocumentAbbrev,
|
||||
}
|
||||
```
|
||||
|
||||
**Page-level metrics.** Each `ExtractedPage` carries a `reconstruction_rate: f32` — the fraction of spans on that page for which `reconstruction_applied` is true. A page with `reconstruction_rate > 0.3` should be flagged in the caller's output as a low-quality extraction, potentially warranting an OCR fallback. The `ReconstructionMethod` distribution across a page (accessible via `page.reconstruction_method_histogram()`) gives the caller a breakdown of *why* reconstruction was applied, which is useful for diagnosing systematic problems in a PDF batch.
|
||||
|
||||
When `reconstruction_applied` is false, `reconstruction_confidence` reflects the confidence of the original extraction, not of the reconstruction pass. This preserves the meaning of the field: it always represents the extractor's confidence in `text`, not in the decision to reconstruct.
|
||||
Loading…
Add table
Reference in a new issue