Add four research documents focused on readable text production

- type3-font-extraction: CharProcs stream parsing, TeX/dvips naming
  conventions, dHash shape fingerprinting, nested font stacks, OCR fallback
- watermark-and-background-separation: five PDF watermark mechanisms,
  transparency tracking, cross-page repetition, WCAG contrast detection,
  raster inpainting, diagonal watermark removal pipeline
- historical-and-degraded-document-extraction: eight degradation categories,
  bleed-through removal, illumination correction, Sauvola binarization,
  stroke reconstruction, Fraktur/long-s handling, confidence-gated output
- complex-layout-reading-order: baseline clustering, XY-cut, Docstrum,
  RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering,
  perplexity-based confidence with natural_order fallback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-16 15:13:10 -04:00
parent 31e715633d
commit f805e52fa3
4 changed files with 655 additions and 0 deletions

View file

@ -0,0 +1,146 @@
# Complex Layout Reading Order Reconstruction
## The Fundamental Problem
PDF content streams encode painting order, not reading order. When an authoring tool renders a two-column academic paper, it may emit all text runs in the left column first, then the right column — or it may interleave them by y-coordinate, painting each horizontal band across both columns before advancing downward. A newspaper layout with three columns and a pull quote may serialize its content in any order the compositor chose. The PDF specification makes no guarantee.
The consequence is direct: even when every glyph is decoded correctly with perfect Unicode mapping, assembling text runs in content-stream order produces output that is unreadable. A reader sees: the first paragraph of column A, then the first paragraph of column B, then the second paragraph of column A. Sentences from unrelated paragraphs abut each other. The information is present but the text is noise.
For mixed-layout pages — a full-width title, a two-column abstract and body, a full-width footnote zone — the problem compounds. No single sorting heuristic handles all three layout regions correctly. A naïve y-descending, x-ascending sort works for single-column documents but produces interleaved text for any multi-column region.
Reading order reconstruction must therefore operate as a distinct post-extraction phase that groups raw glyph streams into spatial regions and imposes a linguistically correct traversal order over those regions.
---
## Baseline Clustering into Lines
The first stage collapses individual glyph bounding boxes into text lines. A glyph box is characterized by its baseline y-coordinate, its left and right x-extent, and its advance width. Glyphs belong to the same line when their baseline y-coordinates fall within a tolerance window:
```
|baseline_a - baseline_b| <= line_height * 0.3
```
where `line_height` is estimated from the median cap-height of the font size in use. The 0.3 factor accommodates minor baseline drift from kerning and glyph descent variation without merging adjacent lines.
Superscripts and subscripts complicate this threshold. A superscript glyph sits above the baseline of its host span and has a reduced font size; it visually belongs to the line it annotates but will fail the baseline proximity test. Detection heuristic: if a glyph's font size is less than 0.7× the modal font size on the line and its baseline is within one line-height of the line, classify it as a super/subscript and attach it to the nearest enclosing span rather than starting a new line.
Rotated text (common in table headers and figure labels, encoded via the text matrix `Tm`) requires separate handling. Extract the rotation angle from the text matrix, cluster rotated glyphs by their rotated baseline, and treat each rotation group as an independent line set. Rotated lines are assigned to their spatial bounding box for zone assignment but are not merged into the main reading order flow; they are emitted as annotated spans within whichever zone contains them.
The output of line clustering is an ordered list of `TextLine` structs, each carrying a bounding box (union of all constituent glyph boxes), a dominant font size, a baseline y-coordinate, and an ordered list of `Span` entries sorted by x-ascending.
---
## Line Merging and Column Assignment via Gap Analysis
With lines established, column detection operates on their x-extents. For each line, record the set of horizontal gaps — intervals of x-space not covered by any glyph in that line. Aggregate gap histograms across a sliding window of consecutive lines (typically 510 lines). A gap position that recurs across multiple lines and exceeds `median_word_space × 3` is a column separator candidate.
`median_word_space` is estimated from the modal inter-glyph spacing within lines at the dominant font size. For 12pt Times New Roman this is approximately 3.5pt; the column-gap threshold becomes roughly 10.5pt, which cleanly separates two-column academic layouts (gap ≈ 1824pt) from inter-word spaces.
Column count inference: sort candidate separator x-positions; the number of columns equals the number of separators plus one. Validate by checking that each column band contains at least `min_lines_per_column` (default: 3) lines. A single separator that only spans two or three lines is more likely a paragraph indent or a figure caption offset than a true column boundary.
Each line is assigned to a column index based on which column band its x-centroid falls into. Lines whose bounding boxes span multiple column bands (full-width lines) are assigned to a synthetic "full-width" zone, which is handled during layout merging.
---
## Recursive XY-Cut Algorithm
XY-cut is the classical divide-and-conquer approach to layout segmentation. Given a set of text bounding boxes occupying a rectangular page region:
1. Project all boxes onto the y-axis. Find the widest horizontal whitespace gap — a y-interval containing no box. This becomes the horizontal cut point, splitting the region into a top half and a bottom half.
2. Within each half, project onto the x-axis. Find the widest vertical whitespace gap. This becomes the vertical cut, splitting into left and right sub-regions.
3. Recurse on each sub-region until no further cuts are possible (the region contains a single column of text or a single text block).
4. The reading order is a depth-first left-to-right, top-to-bottom traversal of the resulting binary tree: for a horizontal cut, top before bottom; for a vertical cut, left before right.
The algorithm is elegant and handles the common cases — two-column academic papers, three-column newsletters — reliably. Its failure modes are:
- **Ambiguous cuts**: when a horizontal gap and a vertical gap have nearly equal widths, the cut order is uncertain. Heuristic: prefer the horizontal cut when gap sizes are within 20% of each other, since reading order is more frequently top-to-bottom than left-to-right.
- **Non-rectangular regions**: a figure that bleeds into the text column creates a non-rectangular text region that a rectangular cut cannot correctly isolate. Pre-detect figures by their bounding boxes and remove them from the text box set before applying XY-cut.
- **Close column gaps**: when the inter-column gap is narrow (common in three-column tabloid layouts), small descenders or accented capitals may bridge the gap, causing the algorithm to fail to find a clean cut. Apply a minimum gap threshold and fall back to Docstrum if no valid vertical cut is found.
---
## Docstrum Algorithm
Docstrum reconstructs reading order from nearest-neighbor relationships rather than whitespace gaps, making it more robust for skewed pages, curved text, and layouts with narrow inter-column margins.
For each text component (a glyph or short span), compute the k nearest neighbors by Euclidean centroid distance, typically k = 5. Classify each neighbor pair by the angle of the connecting vector:
- **Within-line pair**: the connecting vector is near-horizontal (angle within ±45° of 0°/180°) and the distance is less than `2 × char_width`. These pairs become edges in a within-line graph.
- **Between-line pair**: the vector is near-vertical (angle within ±45° of 90°/270°) and the distance is less than `2 × line_height`. These become between-line edges.
Connected components of within-line edges form text lines. Connected components of between-line edges, applied to those lines, form text blocks (paragraphs and columns).
The dominant within-line angle across all pairs gives the page skew; the dominant between-line distance gives the line spacing. Both are valuable for quality validation.
Docstrum's weakness is computational: O(n²) neighbor computation for n components, though a k-d tree reduces this to O(n log n) in practice. It also struggles when text density is very low (wide inter-word gaps that exceed the within-line distance threshold), which can fragment lines incorrectly.
---
## Smearing and Connected-Component Approaches
Projection-based smearing converts the 2D layout problem into 1D histogram analysis. Rasterize all text bounding boxes onto a 1D horizontal projection: for each y-row, count the number of covered pixels. Smooth with a Gaussian kernel (σ ≈ line_height / 4). Peaks correspond to text rows; valleys correspond to inter-line gaps. Apply a threshold to produce a binary row mask.
Similarly, project onto the vertical axis: each x-column counts occupied pixels. Peaks are text columns; valleys are column gaps or margins.
The RLSA (Run-Length Smoothing Algorithm) variant works in binary image space: apply a horizontal smearing operator that closes gaps shorter than a threshold C_h (typically 3× the average character width), then a vertical smearing operator with threshold C_v (typically 3× the line height). The resulting connected components are text blocks. RLSA is fast and works well for typewritten or OCR-processed documents.
Smearing approaches fail when column gaps are narrower than the smoothing kernel or when text blocks have irregular densities (justified text with variable inter-word spacing creates misleading projection valleys).
---
## Mixed-Layout Pages
A mixed-layout page contains horizontal bands of different column structures: a full-width title block, a two-column body, a full-width footer with page number. Correct reading order requires detecting these transitions.
Detection: scan the line set from top to bottom. For each horizontal band of lines (grouped by proximity in y), compute the x-spread. A band whose x-spread exceeds 85% of the page width is a full-width zone. A band whose lines cluster into distinct x-groups is a multi-column zone.
Column-count transitions (from full-width to two-column and back) define zone boundaries. The correct reading order is:
1. Full-width top zone (title, authors, abstract label) — top to bottom.
2. Multi-column body — column by column, left to right, reading each column fully before advancing to the next.
3. Full-width bottom zone (acknowledgements, references header if full-width) — top to bottom.
Figures that interrupt column flow (a figure spanning both columns mid-body) are detected by their bounding boxes crossing the column separator. They are extracted as `Figure` zones at their y-position in the document and inserted into the reading order at the point where the figure y-position occurs within the column being read.
---
## Sidebar and Inset Handling
A sidebar is a narrow text region adjacent to the main body that is not part of the primary reading flow. Detection criteria: bounding box width less than 40% of the page text width; x-position abutting the page margin; and either a visually distinct font family/size or a surrounding rule line (a `re` + `S` sequence in the content stream at the sidebar boundary coordinates).
Insets are text boxes whose bounding boxes overlap with body text — common in magazine layouts and promotional callouts. Detect by checking whether any text block's bounding box intersects the body text zone with an overlap ratio exceeding 10%.
Policy for both: extract sidebar and inset content after the main body text of the page. Tag output spans with `zone: "sidebar"` or `zone: "inset"` so downstream consumers can suppress or separately process them. Do not attempt to interleave sidebar content with body text at the word level — the reading orders are independent.
---
## Footnote Ordering
Footnotes occupy a horizontal band at the bottom of the page, below a separator rule (typically a short horizontal line element), in a font size smaller than the body (usually 0.70.8× body size). Detection: find horizontal rule elements in the lower 25% of the page text area; text blocks below the topmost such rule with font size below 0.85× modal body font size constitute the footnote zone.
Correct ordering: footnotes are emitted after all body text on that page. For multi-column pages, footnotes may span the full column width or be column-specific (column-specific footnotes appear in the same x-band as their host column). Order column-specific footnotes within their column's output; order full-width footnotes after all columns.
Footnote reference marks in the body text (superscript numerals or symbols) can be matched to the corresponding footnote by their textual label. Expose a `footnote_refs` map in page metadata linking body-text span positions to footnote block IDs for consumers that wish to inline them.
---
## Confidence Scoring and Fallback
Reading order reconstruction can fail silently — the output text is syntactically plausible but semantically wrong. Detecting this requires a language-model signal:
- **Character n-gram perplexity**: score the reconstructed text sequence against a character 4-gram model trained on natural language (English default; fall back to script-detected language model). Threshold: if perplexity exceeds 3× the baseline for clean prose, flag the reading order as suspect.
- **Word boundary coherence**: count the fraction of word boundaries that fall at natural break points (space, punctuation) versus mid-word. A high mid-word break rate indicates incorrect line concatenation or wrong reading order.
When confidence falls below threshold, apply the alternate algorithm: if XY-cut was primary, retry with Docstrum; if Docstrum was primary, retry with XY-cut. Accept whichever produces lower perplexity.
Expose in output metadata:
```rust
pub struct ReadingOrderMetadata {
pub algorithm: ReadingOrderAlgorithm, // XyCut | Docstrum | Smearing | NaturalOrder
pub confidence: f32, // 0.01.0
pub fallback_used: bool,
}
```
Provide a `natural_order` fallback mode that sorts text lines strictly by `(y_descending, x_ascending)` — deterministic, fast, correct for single-column documents, and predictably wrong for multi-column. Callers who need reproducible output over possibly incorrect output can opt into this mode explicitly.

View file

@ -0,0 +1,143 @@
# Historical and Degraded Document Extraction
## Overview
Scanned historical documents, microfilm reproductions, low-quality photocopies, and physically degraded originals sit at the difficult end of the OCR spectrum. Each degradation type triggers a different failure mode in the extraction pipeline. Treating them all with a single generic filter produces consistently poor results. This document defines the degradation taxonomy, the algorithms to address each category, and the confidence policy that prevents garbage text from propagating silently into structured output.
---
## 1. Degradation Categories
**Salt-and-pepper noise** — random isolated black or white pixels scattered across the image. Origin: CCD sensor noise during scanning, dirty scanner glass, or film grain on microfilm. These pixels disrupt connected-component analysis and produce spurious characters.
**Background bleed-through** — text printed on the reverse side of thin paper (newspaper stock, onion-skin) transmits light and appears as a faint, laterally-mirrored ghost image. The secondary ink signal overlaps character frequency bands, making simple threshold separation unreliable.
**Uneven illumination** — gradient luminance across the scan: darker corners from a flatbed lid that does not press fully, a bright hotspot at image center from a overhead copy stand, or a gradient from left to right caused by angled ambient light. Otsu-style global thresholding collapses under this condition.
**Physical distortion** — page curl at the binding margin, keystoning from a camera held off-axis, rotational skew up to several degrees, and binding shadow (a darkening gradient toward the spine). Each produces geometric errors that break word and line segmentation.
**Ink spread or fading** — over-inked originals produce strokes that bleed together and merge adjacent characters; under-inked or aged originals produce strokes that are too light or discontinuous. Both extremes harm connected-component character recognition.
**Staining and foxing** — brown ferrous oxidation spots (foxing), water tide-marks, and adhesive residue produce high-contrast blobs in the same intensity range as ink. A naive binarizer classifies them as characters.
**Resolution too low** — below approximately 150 DPI, a lowercase `e` is fewer than 10 pixels tall. Individual stroke features are not resolved; the pixel grid is the limiting factor, not the algorithm.
**Mixed degradation** — a single page may exhibit three or four of the above simultaneously. A 19th-century newspaper scan can have bleed-through, salt-and-pepper noise, and a binding shadow on the same column.
---
## 2. Noise Reduction
Gaussian blur attenuates high-frequency noise but smears edge information, degrading thin character strokes. The **median filter** is the standard choice for salt-and-pepper noise: for each pixel, replace its value with the median of an N×N neighborhood. A 3×3 kernel removes isolated single-pixel noise; 5×5 handles heavier speckle while still preserving stroke edges because the median operation is nonlinear and resists the influence of outlier pixels.
For images with noise density above roughly 20% of pixels, the standard median filter degrades because the median itself may be drawn from noise pixels. The **adaptive median filter** (AMF) solves this by dynamically expanding the kernel size until the local median falls in a plausible range, capping at a maximum window (typically 7×7 or 9×9) before accepting the result.
After median filtering, **morphological opening** (erosion followed by dilation with a small structuring element, typically a 2×2 or 3×3 square) removes any remaining isolated foreground blobs smaller than the structuring element. Because erosion removes thin protrusions and isolated pixels, and the subsequent dilation restores objects that survived erosion, object-sized structures survive while noise pixels do not.
Recommended sequence for a noisy scan:
```
grayscale → median filter (5×5) → Sauvola binarization → morphological opening (3×3)
```
---
## 3. Bleed-Through Removal
Bleed-through is detectable by computing the **normalized cross-correlation** of the grayscale image with its horizontally mirrored version. High correlation (empirically above 0.150.25 depending on paper thickness) indicates bleed-through is present.
Removal relies on the density difference: the primary text is darker than the bleed signal. A locally-adaptive binarization threshold computed on the primary text's ink-density distribution should be tuned to exclude the lighter bleed-through layer. In practice, Sauvola thresholding with `k` pushed toward 0.40.5 (higher than the default 0.2) biases the threshold upward and rejects the lighter bleed pixels.
For severe bleed-through, the **Wiener filter** simultaneously denoises and deblurs in the frequency domain. Given an estimate of the noise power spectrum (from a blank region of the scan) and an assumed point-spread function for bleed (a Gaussian with σ ≈ 1.52.0 px representing paper diffusion), the Wiener filter minimizes the mean-squared error between the restored signal and the true primary text image. This is computationally heavier but appropriate when the bleed is dense enough that Sauvola alone misclassifies it.
---
## 4. Uneven Illumination Correction
The standard approach is **background estimation by large-kernel Gaussian blur**: apply a Gaussian with radius 50100 pixels to the grayscale image. At that radius, all text is blurred away; what remains is an estimate of the smoothly-varying background luminance field. Divide each pixel by its corresponding background estimate, then rescale to [0, 255]. This is the core of homomorphic filtering adapted for reflective (not transmissive) illumination.
An alternative for scans with abrupt luminance changes (such as a shadow edge from a warped page): sample background intensity at a grid of points identified as non-text by their local standard deviation (low σ indicates no texture), fit a polynomial surface (degree 2 or 3) through those sample points using least-squares, and use the polynomial surface as the background estimate.
Both methods must run before binarization. Applying Sauvola to the illumination-corrected image is markedly more reliable than applying Sauvola directly to the raw scan, even though Sauvola is itself local — Sauvola's window cannot span the scale of a full-page gradient.
---
## 5. Geometric Correction
**Deskew** removes rotational skew. Two reliable approaches:
- *Hough transform*: detect line segments in the binary image, cluster their angles, take the dominant angle as the skew, and rotate the image by its negation.
- *Projection profile maximization*: rotate the binarized image in 0.1° steps over ±5°, compute the horizontal projection (row-wise pixel sum), and take the angle that maximizes the variance of that projection. At the correct angle, text lines produce sharp peaks; at other angles, the distribution flattens.
**Page curl** causes text baselines to follow a curve rather than a line. Detect curved baselines by fitting a polynomial (degree 2 or 3) through the centroid positions of connected components in each text line. Warp the image using a mesh warp (bicubic interpolation on a control-point grid) to map the curved baselines to horizontal lines.
**Perspective correction** applies to camera captures. Detect the four corners of the document (Hough lines on the document boundary, or corner-specific feature detectors), compute the projective transform that maps those four corners to a rectangle, and apply the transform with bilinear or bicubic resampling.
**Binding shadow** manifests as a darkening gradient toward the spine. After illumination correction (Section 4), this gradient is largely removed. If residual darkening remains, detect the gradient direction from the background luminance field estimate and apply a compensating brightness ramp along that axis.
---
## 6. Adaptive Binarization for Degraded Images
Global Otsu thresholding computes a single intensity threshold for the entire image. It fails catastrophically under uneven illumination because the optimal threshold for a dark region differs from the optimal threshold for a bright region.
**Sauvola thresholding** computes a local threshold for each pixel:
```
T(x,y) = μ(x,y) · [1 - k · (1 - σ(x,y) / R)]
```
where `μ` and `σ` are the local mean and standard deviation in a window of size W×W, `R` is the dynamic range of the standard deviation (typically 128 for 8-bit images), and `k ∈ [0.2, 0.5]` is a sensitivity parameter. Lower `k` accepts more pixels as foreground; higher `k` rejects lighter pixels.
Window size W should be approximately 23× the height of a typical character stroke in pixels. At 300 DPI, a standard printed character stroke is 35 px wide, so W = 2551 is appropriate. At 150 DPI, W = 1525.
**Wolf-Jolion modification** extends Sauvola to handle documents where ink is very light across the entire page (e.g., faded typewriter output). It normalizes the standard deviation term to the maximum standard deviation observed in the image, preventing the threshold from collapsing when global contrast is low.
**Niblack thresholding** is the predecessor to Sauvola: `T = μ + k·σ`. It tends to introduce more noise in background regions and is generally superseded by Sauvola, but may be useful as a reference baseline.
---
## 7. Stroke Reconstruction for Faded Ink
Faded ink may produce pixel values in the range 180220 (light gray on a 0255 scale with 255 = white), well below what Sauvola classifies as foreground. Pre-processing with **CLAHE** (contrast-limited adaptive histogram equalization) redistributes the local intensity histogram, amplifying low-contrast regions while clipping the redistribution to avoid over-amplifying noise. Apply CLAHE with a tile size of 8×8 or 16×16 and a clip limit of 2.04.0 before binarization.
For strokes that are binarized but broken (gap pixels within a stroke due to uneven fading), **morphological closing** (dilation followed by erosion) reconnects gaps up to the size of the structuring element. A horizontal structuring element (1×3 or 1×5) closes horizontal stroke gaps without merging characters vertically.
For severe cases, **skeleton-based reconstruction** extracts the stroke skeleton (Zhang-Suen or Guo-Hall thinning), which reduces each stroke to a 1-px-wide centerline even if the original stroke was intermittent. The skeleton is then dilated to a standard stroke width, producing a normalized binary image suitable for OCR even if the original was patchy.
---
## 8. Low-Resolution Handling
At 150 DPI, a typical lowercase character is 1520 px tall. At 100 DPI, it is 1013 px. Tesseract's documented minimum for reliable recognition is 300 DPI; it ships with a `--dpi` flag that accepts an override, but the underlying character models are trained at 300 DPI and degrade sharply below 150 DPI.
**Bicubic upsampling** to 300 DPI before OCR is the minimum intervention — it does not recover lost detail but gives the recognizer familiar feature dimensions. For moderate quality gain, **ESRGAN-class super-resolution models** (Real-ESRGAN or a document-specific fine-tune) trained on document imagery can synthesize plausible high-frequency detail. These models are not appropriate for legal or archival use where fabrication of detail is unacceptable, but for readability-oriented extraction they can recover legible characters from 150 DPI inputs.
When the computed DPI is below 100 and the image shows no recoverable character features (assessed by measuring the variance of the horizontal projection profile — very low variance indicates characters are not distinguishable), the pipeline should emit a `low_quality_page` warning and still return the best-effort text, rather than silently inserting high-confidence garbled output.
---
## 9. Script and Typeface Detection for Historical Documents
Historical documents may be typeset in scripts and conventions no longer current:
- **Blackletter** (Fraktur, Schwabacher, Textura): dominant in German-language printing through the 1940s. Recognizable by the high angle of oblique strokes (typically 4060° from horizontal, compared to 1020° for Roman). A histogram of local gradient orientations in the binary image distinguishes blackletter from Roman with high reliability. Tesseract provides a `script/Fraktur` language pack trained on 19th-century German texts; recognition quality is significantly below Latin for degraded inputs and improves with pre-processing.
- **Long s** (`ſ`, U+017F): used in early modern printing for non-final `s`. OCR models trained on modern text misclassify it as `f`. Post-processing rules can correct `ſ→f` substitutions in known-context positions (not at word-final positions, not before another `s`).
- **Typewriter fonts**: monospaced, lighter ink density than letterpress, often on thin paper with higher bleed-through risk. The uniform character width is an asset for segmentation but the lighter ink requires lower Sauvola `k`.
- **Ligatures**: fi, fl, ffi, ffl, ct, st, and the long-s ligatures ſi, ſl are common in 18th19th-century setting. These are single glyphs occupying the width of two characters; models that segment character-by-character before recognition will fail on them. Tesseract's LSTM engine handles ligatures at the word level and is preferred over the legacy mode for historical documents.
---
## 10. Confidence-Gated Fallback
Tesseract's C API exposes `ResultIterator::Confidence()`, which returns a per-word confidence in [0, 100]. Aggregate to the **block level** by taking the mean confidence across all words in a block, and to the **page level** by taking the mean across all blocks (weighted by block word count).
Output policy:
- **Page-level confidence ≥ 60**: emit text normally.
- **40 ≤ page-level confidence < 60**: emit text with a `degraded_quality` annotation in the extraction metadata. The text is usable but should be treated as approximate.
- **Page-level confidence < 40**: emit a `low_quality_page` warning in the structured output. Include the best-effort text — do not discard it — but mark it explicitly so that downstream consumers (e.g., LLM pipelines) can weight it appropriately or skip it.
Never silently emit garbled text without confidence metadata. A word recognition confidence below 20 should be individually flagged; the extraction output format should support per-word confidence annotation, not just per-page. This allows downstream consumers to apply their own threshold rather than receiving binary pass/fail decisions from the extractor.
The confidence gating applies after all pre-processing. Running the full degradation-correction pipeline before measuring confidence ensures that the confidence score reflects true unrecoverability rather than a correctable image quality issue.

View file

@ -0,0 +1,168 @@
# Type 3 Font Extraction
Type 3 fonts are the most specification-compliant yet practically difficult font type in the PDF format. Unlike Type 1, TrueType, or CFF fonts — which encode glyph outlines in standardized binary formats — Type 3 fonts define each glyph as an arbitrary PDF content stream. This makes them maximally flexible but maximally opaque to text extraction. A Rust implementation must treat Type 3 handling as its own sub-pipeline.
## 1. Type 3 Font Dictionary Structure
A Type 3 font dictionary (PDF spec §9.6.5) contains the following mandatory and commonly present entries:
- **`FontBBox`**: A rectangle (in glyph space) that encompasses all glyphs in the font. Used for rasterization clipping.
- **`FontMatrix`**: A six-element transformation matrix mapping glyph space to text space. For Type 3, this is typically `[0.001 0 0 0.001 0 0]` (same as Type 1) but is frequently used for scaling in TeX-generated fonts (e.g., `[1 0 0 1 0 0]` when the glyph streams work directly in text units).
- **`CharProcs`**: A dictionary whose keys are glyph names (e.g., `/A`, `/uni0041`, `/cmr10-a`) and whose values are indirect references to content stream objects. Each stream is a self-contained glyph program.
- **`Encoding`**: Either a predefined encoding name or an Encoding dictionary with a `Differences` array. Maps 1-byte character codes (0255) to glyph names. This is the first hop in code resolution.
- **`FirstChar`** / **`LastChar`**: Integer bounds of the character code range covered by the `Widths` array.
- **`Widths`**: Array of advance widths in glyph space units for character codes `FirstChar` through `LastChar`. A code outside this range or with a width of zero is not encoded.
- **`Resources`**: A resource dictionary shared by all CharProcs streams in the font. Can contain sub-fonts, XObjects, color spaces, and graphics state parameters.
**Character code resolution chain:**
```
character code (u8)
→ Encoding dictionary → glyph name (e.g., "/hyphen")
→ CharProcs dictionary → content stream (indirect ref)
```
Missing any link in this chain means the character is not renderable via the font's own mechanism. Record which link broke for downstream fallback routing.
## 2. What Type 3 Glyph Streams Contain
Each CharProcs value is a content stream parsed identically to a page content stream, but with two additional operators:
- **`d0 wx wy`**: Declares the advance width `(wx, wy)` in glyph space. No bounding box is declared; caching is disabled. The glyph appearance may be empty (whitespace glyph) or rendered without cache.
- **`d1 wx wy llx lly urx ury`**: Declares advance width and glyph bounding box. The viewer may cache the rendered result. This is the standard form for non-whitespace glyphs.
`d0` or `d1` must be the first operator in every CharProcs stream. After it, the stream may contain:
- **Path construction and painting**: `m`, `l`, `c`, `h`, `f`, `S`, `B`, etc. for vector glyph shapes. Most Type 3 fonts used for math symbols or decorative purposes are vector-only.
- **Image XObjects**: `Do` referencing an image XObject in the font's `Resources`. Common in scanned-font Type 3 fonts or bitmap glyph sets.
- **Text operators**: `BT`/`ET` blocks with `Tf`/`Tj`/`TJ` — a CharProcs stream can itself paint text using another font, including another Type 3 font. This is the nested Type 3 scenario.
- **Graphics state changes**: `q`/`Q`, `cm`, `w`, `J`, color operators. These affect only the glyph's internal coordinate system and should not escape it.
**The core text-extraction problem**: the content stream encodes appearance, not identity. There is no intrinsic Unicode codepoint stored in the stream. Identity must be recovered through external mappings.
## 3. Unicode Recovery: Priority Chain
Implement Unicode recovery in this strict priority order:
### (a) ToUnicode CMap
If the Type 3 font dictionary includes a `ToUnicode` entry referencing a CMap stream, parse it exactly as for any other font type. This is authoritative and should short-circuit all other recovery paths. It is rare in hand-crafted Type 3 fonts but appears in PDF generators that auto-embed it.
### (b) Glyph Name via Adobe Glyph List (AGL)
The glyph name from `CharProcs` is the primary recovery path in practice. Apply the AGL algorithm (Adobe Glyph List for New Fonts, specification version 1.7):
1. If the name is in the AGL table directly, map it.
2. If the name starts with `uni`, parse the hex suffix as one or more UTF-16BE codepoints.
3. If the name starts with `u` followed by 46 hex digits, parse as a single codepoint.
4. If the name contains a period (e.g., `A.sc`, `hyphen.alt`), use only the base component before the period for lookup.
5. Otherwise, the name is unrecognized — proceed to the next fallback.
Store the AGL as a static sorted array of `(&'static str, u32)` pairs and binary-search by name at runtime.
### (c) TeX Encoding Heuristics
When the font name matches a TeX Computer Modern pattern (see §4), use the known encoding vector for that font's TeX encoding scheme to resolve glyph names that AGL does not cover. TeX glyph names in Type 3 often do not follow AGL conventions and require a separate lookup table.
### (d) Shape Fingerprinting
Render the CharProcs content stream to a small raster and compare against a precomputed database of Unicode glyph hashes (see §56).
### (e) Context-Based Inference
In a sequence of resolved glyphs with one unknown, contextual n-gram analysis over the resolved neighbors can sometimes disambiguate with reasonable confidence. This is a last resort before emitting U+FFFD.
## 4. TeX/dvips Type 3 Fonts
TeX documents compiled via `dvips` or similar tools embed Type 3 fonts for Computer Modern and related math fonts. These fonts follow predictable conventions:
**Font name pattern**: TeX-generated Type 3 font names are typically a 6-character uppercase prefix (a subset checksum, e.g., `ABCDEF`) followed by a plus sign and the Metafont name: `ABCDEF+CMR10`, `GHIJKL+CMMI10`, `MNOPQR+CMSY10`, `STUVWX+CMEX10`.
**Detection heuristic**: if `BaseFont` matches `^[A-Z]{6}\+CM`, classify as TeX Type 3. Also check for `MSBM` (AMS blackboard bold), `EUFM` (Euler Fraktur), and `WASY` (Wasy symbol set) prefixes.
**Encoding vectors**: TeX uses non-standard 8-bit encodings. The relevant ones for glyph name resolution:
- **OT1** (original TeX text encoding): remaps standard glyph positions; `\quotedblleft` at 0x22, ligatures at positions standard fonts leave empty.
- **OML** (math italic): slots 0x000x7F hold lowercase Greek and math italic Latin.
- **OMS** (math symbol, CMSY): contains operators like `\cdot`, `\times`, `\ast`, `\pm` at known positions.
- **OMX** (math extension, CMEX): large delimiters, integral signs, extensible arrows — stored as multi-part glyph sequences.
Embed these encoding vectors as static lookup tables keyed on `(encoding_name, glyph_position)``char`. When the font name identifies a TeX font family, cross-reference the CharProcs glyph names against these tables before falling through to shape matching.
## 5. Glyph Rendering for Shape Matching
When name-based recovery fails, implement a minimal PDF graphics interpreter to rasterize the CharProcs content stream:
1. **Coordinate system**: Apply `FontMatrix` to establish glyph-to-user space. Use `FontBBox` as the clip region.
2. **Operators to support**: path construction (`m l c v y h`), path painting (`f F S s B B* b b* n`), `cm` (CTM update), `q`/`Q` (graphics state stack), `Do` (image XObjects only — do not recurse into form XObjects for shape matching).
3. **Target raster**: 64×64 pixels is sufficient for shape fingerprinting. Use 8-bit grayscale. Rasterize filled paths as white-on-black.
4. **Normalization**:
- Compute the center of mass of foreground pixels and translate so it aligns with the raster center.
- Scale the bounding box of foreground pixels to fill ~80% of the raster extent.
- Apply mild Gaussian blur (σ ≈ 1.0) to suppress sub-pixel sensitivity.
5. **Hash computation**: Compute a difference hash (dHash) over the 64×64 raster — downsample to 8×8, compare adjacent pixels left-to-right, produce a 64-bit integer. Store as `u64`.
6. **Matching**: Compare the query hash against all entries in the glyph hash database using Hamming distance. A distance ≤ 8 (out of 64 bits) is a confident match; 915 is a weak match worth flagging with reduced confidence; > 15 is a non-match.
## 6. Building the Unicode Glyph Hash Database
The hash database must be precomputed offline and bundled with the library as a binary asset.
**Reference fonts**: render glyphs from DejaVu Serif, DejaVu Sans, Liberation Serif, Liberation Sans, GNU FreeFont (FreeSerif, FreeSans, FreeMono). Use multiple point sizes (12pt, 24pt, 48pt) and average or union the hash sets to reduce size-sensitivity.
**Coverage targets**: Basic Latin (U+0020U+007E), Latin-1 Supplement (U+00A0U+00FF), Latin Extended-A/B for common accented forms, Greek (U+0370U+03FF), Cyrillic (U+0400U+04FF), General Punctuation (U+2000U+206F), Mathematical Operators (U+2200U+22FF), Letterlike Symbols (U+2100U+214F), Arrows (U+2190U+21FF).
**Collision handling**: Multiple codepoints may hash identically (e.g., `l` vs `I` in some fonts). Store collisions as a small `Vec<u32>` per hash bucket. When a query matches a collision bucket, emit the first codepoint with `confidence: 0.5` and annotate the span with `ambiguous: true`.
**Database format**: a sorted `Vec<(u64, u32)>` (hash, codepoint) serialized with `bincode` or as a flat binary array. At query time, binary-search by hash; if not found exactly, scan neighbors within Hamming distance 8 using a BK-tree or linear scan over the sorted list.
**Stroke width variation**: vector glyphs in Type 3 fonts may be thicker or thinner than reference fonts. Normalize stroke width by morphologically thinning foreground pixels to 1-pixel skeletons before hashing both query and reference glyphs, or generate multiple reference hashes per codepoint at varying simulated stroke widths.
## 7. Nested Type 3 Fonts
A CharProcs stream may invoke another font via `BT ... Tf /FontName sz Tf ... Tj ... ET`. The nested font is resolved from the Type 3 font's own `Resources` dictionary, not the page's resource dictionary.
**Font stack tracking**: maintain a `Vec<FontRef>` during CharProcs stream execution. When `Tf` is encountered inside a CharProcs stream, push the new font onto the stack. When `ET` closes the text block, pop. Cap depth at 8 to prevent pathological recursion (though the PDF specification does not permit loops, malformed files may contain them).
**Nested encoding resolution**: resolve the nested font's character codes independently through its own encoding and CharProcs chain. Concatenate the resulting Unicode spans from the nested text into the parent glyph's output as if they were a single logical character sequence.
**Width accounting**: the outer glyph's advance width (from `d0`/`d1`) takes precedence over the sum of nested glyph widths for layout purposes.
## 8. Width-Only Glyphs (d0)
Glyphs declared with `d0` provide an advance width but no bounding box. Their appearance is never cached and may be blank (used for whitespace) or may produce visible ink that is still useful for shape matching.
Even when rendering fails entirely, the advance width is available. Use it for:
- **Whitespace detection**: if `wx` matches a known word-space width for the current font size, emit U+0020.
- **Width-profile matching**: build a width vector for a sequence of unknown glyphs and compare against frequency distributions of English letter widths. This is probabilistic but can disambiguate `i`/`l`/`1` or `m`/`w` when used with context.
Record width in the output span regardless of whether Unicode was recovered. Downstream layout reconstruction depends on it.
## 9. OCR Fallback
When all preceding methods fail to recover a Unicode mapping with acceptable confidence:
1. **Compute glyph bounds in page space**: use the text matrix, font size, and advance width to determine the bounding rectangle of the glyph on the page.
2. **Crop the rendered page**: if a rasterized page image is available (e.g., from a prior rasterization pass), extract the crop at the computed bounds, padded by 20% on each side.
3. **Run OCR**: pass the crop to a Tesseract instance (via `leptess` or a raw FFI binding) configured for single-character recognition (`--psm 10`). Limit the character whitelist to printable ASCII plus any script detected elsewhere on the page.
4. **Align OCR output**: Tesseract returns a string; for a single-character crop this should be 02 characters. Accept a single character result; reject multi-character results as likely noise.
5. **Confidence threshold**: Tesseract provides a mean confidence score (0100). Accept results above 70; mark 5070 as low confidence; reject below 50 and emit U+FFFD.
OCR on individual glyphs is expensive. Gate it behind a per-page budget (e.g., at most 50 OCR crops per page) to avoid pathological performance on pages that are entirely Type 3 text with no recoverable names.
## 10. Output Representation
Every span derived from Type 3 glyph extraction carries the following metadata fields:
- **`font_type: "type3"`**: always set for Type 3 derived spans.
- **`unicode_source`**: one of:
- `"to_unicode_cmap"` — recovered from an explicit ToUnicode CMap entry.
- `"glyph_name_agl"` — recovered via the Adobe Glyph List algorithm from the CharProcs key.
- `"tex_encoding"` — recovered from a TeX OT1/OML/OMS/OMX encoding table.
- `"shape_fingerprint"` — recovered by rasterizing the glyph and matching against the hash database.
- `"ocr_fallback"` — recovered by OCR on the rendered page crop.
- `"unknown"` — all methods exhausted without a confident match.
- **`confidence`**: a `f32` in `[0.0, 1.0]`. `to_unicode_cmap` and `glyph_name_agl` emit `1.0`. `tex_encoding` emits `0.95`. `shape_fingerprint` maps Hamming distance linearly: distance 0 → `1.0`, distance 8 → `0.75`. `ocr_fallback` maps Tesseract confidence divided by 100.
- **`readable: bool`**: `false` when `unicode_source == "unknown"`. Spans with `readable: false` emit U+FFFD (U+FFFD, `'\u{FFFD}'`) into the text output and are excluded from readability scoring.
This structure allows downstream consumers to filter by confidence, audit the recovery chain, and make informed decisions about whether to invoke additional post-processing (e.g., a full-page OCR pass) when `unknown` spans exceed a threshold fraction of the page.

View file

@ -0,0 +1,198 @@
# Watermark and Background Separation
## Purpose
Watermarks, background images, decorative graphics, and repeating patterns degrade text extraction in two distinct ways: they inject unwanted strings into the text stream when rendered as PDF text operators, and they reduce OCR accuracy on scanned pages by overlapping with real characters. This document describes how each mechanism manifests in the PDF specification, how to detect each variant, and what suppression policy to apply.
---
## 1. How Watermarks Appear in PDFs
Five distinct mechanisms produce watermark or background content:
**(a) Semi-transparent text via ExtGState `ca`.**
A graphics state dictionary in the page's `ExtGState` resource can set `ca` (fill alpha) to a value between 0 and 1. The content stream loads this state with `gs`, then renders text normally with `BT`/`ET` operators. The rendered text appears faded on screen but is fully present in the content stream. Detection requires tracking the current alpha during parsing, not inspecting the visual output.
**(b) Large image XObject behind page content.**
The content stream places a full-page or near-full-page image using `Do` before any text operators appear. The image is an indirect reference to an XObject of subtype `Image` in the page's `Resources` dictionary. Background images placed this way are ordering-dependent: the `Do` precedes `BT`, which is the positional signal.
**(c) Form XObject repeated via Resources.**
A single Form XObject (XObject subtype `Form`) defined once in the PDF and referenced from the `Resources` of multiple pages. On each page the content stream invokes it with `Do` as one of the first operations. Because the form is defined once and shared, its content stream is parsed independently of each page's content stream. Detection requires cross-referencing which XObjects appear in the Resources of many pages and which are invoked early in each content stream.
**(d) OCG layer marked as background.**
An Optional Content Group with a `Name` of "Background", "Watermark", or similar, referenced via a `Marked Content` sequence (`/OC BMC ... EMC` or `BDC ... EMC`). The OCG's `Intent` array or the `Usage` dictionary `View` entry may have `PrintState` or `ViewState` set to `OFF`. Content inside this marked region is background by declaration. The OCG name and intent are the primary signals; see the optional-content-groups research document for the full OCG traversal algorithm.
**(e) Low-contrast color text.**
Text rendered in light gray (e.g., RGB `0.85 0.85 0.85`) against a white background, or very light tint of any hue. No alpha involved; the graphics state fill color set by `rg` or `g` operators carries the signal. The contrast ratio between the text color and the background estimate determines whether the text is decorative.
---
## 2. Transparency-Based Detection
During content stream parsing, maintain a graphics state stack mirroring what `q`/`Q` operators push and pop. Each stack frame carries:
```
struct GState {
fill_alpha: f32, // ca, default 1.0
stroke_alpha: f32, // CA, default 1.0
blend_mode: BlendMode,
ctm: Matrix3x3,
fill_color: Color,
}
```
When a `gs` operator references an ExtGState dictionary, extract `ca`, `CA`, and `BM` from that dictionary and update the current frame. When a text span or image `Do` is encountered, annotate it with the current `fill_alpha`.
**Alpha threshold:** spans or images with `fill_alpha < 0.5` are watermark candidates. The threshold accounts for watermarks typically rendered between 0.1 and 0.4 alpha.
**Blend mode signal:** blend modes `Multiply`, `Screen`, `Overlay`, and `Luminosity` are structurally typical for watermarks. A span with alpha between 0.5 and 0.8 but a non-Normal blend mode should be escalated to a watermark candidate. Normal blend mode at alpha = 1.0 is never a watermark by this signal alone.
**Area weighting:** a single character at low alpha is not a watermark. A text element whose bounding box covers more than 30% of the page area at low alpha is a strong watermark candidate.
---
## 3. Positional Repetition Detection
Some watermarks are rendered at full opacity (alpha = 1.0) but appear at a fixed position on every page. Detection requires a cross-page pass.
Build a normalized position inventory during the first parse pass. For each text span and image `Do`, record:
```
(normalized_x, normalized_y, width_fraction, height_fraction, content_hash)
```
where coordinates are divided by the page's `MediaBox` dimensions. After parsing all pages, count how often each `(normalized_x, normalized_y, content_hash)` tuple appears. Elements present on more than 80% of pages at the same normalized position are watermark candidates regardless of alpha.
**Diagonal watermarks:** a diagonal "CONFIDENTIAL" or "DRAFT" watermark is typically centered on the page with a rotated CTM. The CTM rotation angle (extracted from the `cm` operator or inherited via Form XObject) of ±45° combined with a bounding box centered near (0.5, 0.5) normalized is a diagnostic pattern. The positional repetition check applies equally — the normalized center position and rotation angle form the key.
**Recto/verso patterns:** for duplex-printed documents, a watermark may appear only on odd or even pages. The 80% threshold handles this naturally if the document has more than ten pages; for shorter documents, run the check separately on odd and even page sets.
---
## 4. Form XObject Reuse as Background
A Form XObject used as a background is parsed once but `Do`-invoked on every page. The detection algorithm:
1. For each page, record the order in which XObjects are invoked relative to `BT` operators. An XObject invoked before any `BT` on the page gets a `pre_text = true` flag.
2. Count how many pages invoke each XObject with `pre_text = true`.
3. Any XObject invoked with `pre_text = true` on more than 80% of pages is a background Form XObject candidate.
4. Parse the Form XObject's own content stream. If it contains `BT`/`ET` sequences, the background carries text (common in letterhead watermarks). If it contains only path operators (`m`, `l`, `c`, `re`, `f`, `S`) or image `Do` operators, it is a purely graphic background.
This classification determines suppression: text-carrying Form XObjects need text-level filtering; graphic Form XObjects are suppressed at the render level.
---
## 5. Color-Based Filtering
Track the current fill color during parsing. For `g` (grayscale), `rg` (RGB), `k` (CMYK), and their stroke equivalents, maintain the current color in the graphics state.
Compute the WCAG relative luminance for each text span's fill color:
```
L = 0.2126 * linearize(R) + 0.7152 * linearize(G) + 0.0722 * linearize(B)
// linearize(c) = c/12.92 if c <= 0.04045, else ((c+0.055)/1.055)^2.4
```
Assuming a white background (`L_bg = 1.0`), the contrast ratio is `(L_bg + 0.05) / (L_text + 0.05)`. Text with contrast ratio below 2.0 is likely decorative or a watermark. Text with contrast between 2.0 and 3.0 is ambiguous and should be labeled but not suppressed by default.
For non-white backgrounds, the background luminance must be estimated. If the page contains a background image, use the median luminance of the region beneath the text span. If no background image exists, assume white.
---
## 6. OCR Preprocessing: Raster Watermark Removal
For scanned PDFs, the watermark is baked into the raster image. Two detection approaches apply before passing the page image to Tesseract:
**Connected components approach:** binarize the page image (Otsu threshold). Run connected-component labeling. Very large connected components that span more than 20% of the page width or height, are not rectangular (i.e., shaped like text glyphs), and whose pixel color deviates from the local background by less than 30 gray levels are watermark region candidates. Inpaint these regions by replacing pixels with the local median background color (sampled from a 16-pixel border around the component's bounding box).
**Frequency domain approach:** periodic watermarks (repeating logos or patterns) appear as discrete peaks in the 2D discrete Fourier transform of the page image. Apply a notch filter centered on those peaks, then invert the DFT. This is effective for grid or tiling patterns but less targeted than connected-component inpainting for text watermarks.
Inpainting is applied regardless of the output suppression policy — the OCR input must be clean even if the caller has requested `include_watermarks: true`.
---
## 7. Diagonal Text Watermarks on Scans
"CONFIDENTIAL", "DRAFT", and "COPY" watermarks typically appear at 45° rotation, large font, spanning the page diagonally. Detection on the rasterized image:
1. **Hough line transform** on the binarized image restricted to angles 40°50°. A strong response in this range with lines passing through the page center signals a diagonal text watermark.
2. **Large connected components at 45° orientation:** compute the principal axis of each large connected component (PCA on pixel coordinates). Components whose principal axis is within 5° of 45° and whose bounding box area exceeds 5% of the page are candidates.
3. **Confirmation by OCR in the rotated region:** rotate the candidate region by 45° and run Tesseract on the sub-image. If the recognized text matches a known watermark vocabulary ("CONFIDENTIAL", "DRAFT", "COPY", "SAMPLE", "VOID") with confidence > 0.7, the region is confirmed.
4. Mask the confirmed region with the local background estimate before the main OCR pass.
---
## 8. Background Images vs. Content Images
Both appear as XObjects of subtype `Image`, but their roles differ:
| Signal | Background image | Content figure |
|---|---|---|
| Rendered area / page area | > 80% | < 60% |
| Position in content stream | Before `BT` | After `BT` or between text blocks |
| Image content entropy | Low (solid color, gradient) | High (photograph, chart) |
| Proximity to text | Text overlaps the image | Text is adjacent, not overlapping |
Compute image entropy as the Shannon entropy of the pixel value histogram (8-bit grayscale, 256 bins). A solid-color image has entropy near 0; a photograph typically has entropy above 5 bits. Threshold at 3.0 bits: below is background, above is content.
The content stream ordering check is the highest-confidence signal and should gate the entropy check. An image placed after all text operators on a page cannot be a background by definition.
---
## 9. Suppression Policy
Three disposition options apply per detected watermark element:
**(a) Exclude from text output entirely.** Default for pure decorative elements (graphic Form XObjects, background images, transparent non-text spans). No representation in the output text stream.
**(b) Include with `zone: "watermark"` label.** The watermark text span is included in the main text stream but tagged so callers can filter it. Useful when the caller needs to be aware of what the document says (e.g., "DRAFT") without mistaking it for body text.
**(c) Include with `visible: false`.** The span is present in the structured output but excluded from any plain-text serialization. Callers querying the structured representation can access it; plain-text users cannot.
The caller controls behavior via:
```rust
pub struct ExtractionOptions {
pub include_watermarks: bool, // default: false
pub watermark_zone_label: bool, // default: true (when include_watermarks = true)
}
```
For scanned pages, inpainting is unconditional — it happens before OCR regardless of the output policy.
---
## 10. Output Structure
Each page's output includes a `watermarks` array:
```rust
pub struct WatermarkRecord {
pub kind: WatermarkKind, // Text | Image | FormXObject
pub text: Option<String>, // populated for text watermarks
pub bbox: Rect,
pub alpha: Option<f32>, // None if detected by repetition or color
pub detection_method: DetectionMethod,
pub page_indices: Vec<usize>, // pages where this watermark was detected
}
pub enum DetectionMethod {
Transparency, // ca < 0.5
Repetition, // same position on > 80% of pages
ColorContrast, // WCAG contrast < 2.0
OcgLayer, // marked inside a background OCG
RasterDetection, // connected component or Hough on scan
}
```
Text spans that are included in the main stream despite being watermarks carry:
```rust
pub struct TextSpan {
// ...
pub zone: Option<ZoneLabel>, // Some(ZoneLabel::Watermark) when applicable
pub visible: bool,
}
```
The `watermarks` array is populated even when `include_watermarks: false` — callers can always inspect what was suppressed without requesting its inclusion in the text stream.