From 116db89c952cf27ef76136e2717abbd86cc47708 Mon Sep 17 00:00:00 2001 From: jedarden Date: Sat, 16 May 2026 15:22:08 -0400 Subject: [PATCH] Add three research documents on routing and text reconstruction - word-boundary-reconstruction: expected position formula with Tc/Tw/Tz, TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold strategies including adaptive histogram, multi-column gap discrimination - scanned-vs-vector-page-classification: four-category taxonomy, fast pre-checks, image coverage AABB computation, character density ratio, validity rate, glyph bbox plausibility, region routing map, confidence scoring with cost-aware OCR threshold - pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP pdfaid detection, Level B/U/A guarantee implications for extraction, font embedding requirements, artifact tagging, PDF/A-3 embedded files, PdfaLevel enum with per-level fast-path branching Co-Authored-By: Claude Sonnet 4.6 --- .../pdfa-compliance-and-extraction.md | 223 ++++++++++++++++++ .../scanned-vs-vector-page-classification.md | 203 ++++++++++++++++ docs/research/word-boundary-reconstruction.md | 202 ++++++++++++++++ 3 files changed, 628 insertions(+) create mode 100644 docs/research/pdfa-compliance-and-extraction.md create mode 100644 docs/research/scanned-vs-vector-page-classification.md create mode 100644 docs/research/word-boundary-reconstruction.md diff --git a/docs/research/pdfa-compliance-and-extraction.md b/docs/research/pdfa-compliance-and-extraction.md new file mode 100644 index 0000000..62bc262 --- /dev/null +++ b/docs/research/pdfa-compliance-and-extraction.md @@ -0,0 +1,223 @@ +# PDF/A Compliance and Extraction + +PDF/A (ISO 19005) is the ISO archival subset of PDF. Its structural guarantees are not merely administrative — they directly eliminate the major failure modes in text extraction. A compliant PDF/A document removes uncertainty about font encoding, reading order, and content accessibility. This document enumerates those guarantees, explains how to detect them from the document catalog, and defines the optimized extraction path for each conformance level. + +--- + +## 1. PDF/A Variants Overview + +PDF/A has four published parts, each based on a different PDF specification version: + +- **PDF/A-1** (ISO 19005-1, 2005): based on PDF 1.4. Conformance levels **a** (accessible) and **b** (basic). Level A requires full tagging; Level B requires only font embedding and device-independent color. +- **PDF/A-2** (ISO 19005-2, 2011): based on PDF 1.7 (ISO 32000-1). Adds level **u** (Unicode), which mandates ToUnicode mappings for every character. Also permits JPEG2000, transparency, and optional content groups that PDF/A-1 forbids. +- **PDF/A-3** (ISO 19005-3, 2012): based on PDF 1.7. Identical to PDF/A-2 in conformance levels (a, b, u) but lifts the restriction on embedded file formats — arbitrary file attachments are permitted with a declared relationship. +- **PDF/A-4** (ISO 19005-4, 2020): based on PDF 2.0 (ISO 32000-2). Restructures levels into **f** (files — replaces b, requires at least one embedded file or none) and **e** (engineering — for technical drawings). The Unicode requirement from level U is folded into the baseline for PDF/A-4f and PDF/A-4e. + +For text extraction, the relevant capability gradient is: B → U → A, where each step adds a stronger structural guarantee that eliminates a class of heuristics. + +--- + +## 2. Detection + +### XMP Metadata Declaration + +PDF/A conformance is self-declared in an XMP metadata stream attached to the document catalog (the root object). The relevant namespace is: + +``` +http://www.aiim.org/pdfa/ns/id/ +``` + +Two properties carry the conformance claim: + +| XMP Property | Value Type | Examples | +|----------------------|------------|----------------| +| `pdfaid:part` | Integer | `1`, `2`, `3`, `4` | +| `pdfaid:conformance` | String | `A`, `B`, `U`, `F`, `E` | + +The XMP stream is located via `Catalog -> Metadata` (a stream object with `Subtype /XML`). Parse the raw XML — it is serialized RDF/XML — and extract the two properties from the `pdfaid:` namespace. + +A document declaring PDF/A-2u will have: + +```xml +2 +U +``` + +### Corroborating Signals + +For level A, the `MarkInfo` dictionary in the catalog provides an independent signal: + +``` +Catalog -> MarkInfo -> /Marked true +``` + +If `Marked` is `true` but `pdfaid:conformance` is `B` or absent, the document is tagged but not necessarily PDF/A-compliant — treat the tagging as opportunistic rather than guaranteed correct. The XMP declaration is the authoritative source; `MarkInfo` is confirmatory. + +### When to Trust the Declaration + +PDF/A validation is an external concern (validators such as veraPDF implement the full rule set). For extraction purposes, treat the XMP declaration as sufficient if: + +1. The XMP stream is present and parseable. +2. The `pdfaid:part` and `pdfaid:conformance` values are valid. +3. The document was produced by a known-good authoring tool (check `xmp:CreatorTool` as a heuristic — PostScript distillers and document converters frequently produce non-compliant PDFs that falsely declare PDF/A). + +For documents with suspicious provenance, verify independently: confirm that `FontDescriptor` entries contain `FontFile`/`FontFile2`/`FontFile3`, that no `Encrypt` dictionary is present, and that `StructTreeRoot` is present for level A claims. + +--- + +## 3. Level B Guarantees + +Level B (basic) is the minimum conformance tier. It establishes the structural preconditions that make reliable extraction possible at all: + +- **No encryption**: the `Encrypt` dictionary must be absent. Content is always accessible without a password. The extraction engine can skip the decryption path entirely. +- **All fonts embedded**: every font referenced in a content stream must have its data embedded in `FontDescriptor.FontFile`, `FontDescriptor.FontFile2` (TrueType), or `FontDescriptor.FontFile3` (CFF/OpenType/Type1C). Partial embedding is not permitted if the missing glyphs appear in the document. +- **No external content references**: no `URI` actions that load remote resources, no external graphic imports. The document is self-contained. +- **No JavaScript or launch actions**: `AA` and `OpenAction` entries must not contain `JavaScript` or `Launch` actions. +- **Device-independent color**: all colors are expressed in ICC-profiled or device-independent spaces, or the document declares an `OutputIntent` ICC profile that gives device-dependent operators (`RG`, `rg`, `K`, `k`) a defined meaning. + +The practical consequence for extraction: **font data is always present**. The fallback path that handles missing font files (glyph shape fingerprinting, width heuristics, external font databases) is unnecessary for Level B and above. + +--- + +## 4. Level U Guarantees + +PDF/A-2u and PDF/A-3u add a single critical requirement on top of Level B: + +**Every character code in every content stream must have a ToUnicode mapping.** + +The `ToUnicode` CMap stream must be present in every `Font` dictionary, and the mapping must cover every code point that appears in the document's text operators (`Tj`, `TJ`, `'`, `"`). There are no gaps, no unmapped ranges, and no reliance on glyph name heuristics. + +This is the most important guarantee for text extraction. The two-stage encoding resolution process — first attempt ToUnicode, fall back to glyph name normalization, fall back to shape fingerprinting — collapses to a single step: read the ToUnicode CMap and apply it directly. + +The extraction engine can skip: +- Glyph name to Unicode inference (Adobe Glyph List lookups, `/uni`-prefixed name parsing). +- Shape fingerprinting against reference glyph databases. +- Width-based character disambiguation. +- Encoding difference array fallback for Type1 fonts. + +Implement a fast path: if `pdfaid:part` is `2` or `3` and `pdfaid:conformance` is `U` or `A`, assert that every `Font` object has a `ToUnicode` entry and decode all text exclusively through those CMaps. If a `ToUnicode` entry is missing on a Level U document, the document is non-conformant — log a warning and fall back to standard recovery, but do not silently proceed as if it were guaranteed correct. + +--- + +## 5. Level A Guarantees + +Level A (accessible) adds full logical structure on top of Level U: + +- **Tagged content**: `Catalog.StructTreeRoot` is present. Every page's content stream elements are either tagged (associated with a structure element via marked content sequences `BDC`/`EMC` with an `MCID`) or explicitly marked as artifacts. +- **`MarkInfo /Marked true`**: declared in the catalog. +- **Reading order encoded**: the `StructTreeRoot` tree encodes the logical reading order of all tagged content. The leaf nodes (`Span`, `P`, `Figure`, etc.) appear in the tree in document logical order, not in page painting order. +- **Role mapping**: `Catalog.StructTreeRoot.RoleMap` maps custom element types to standard PDF structure types (defined in ISO 32000 Table 333). + +The extraction consequence: reading order is already solved. The heuristic reading order algorithm — column detection, bounding-box sorting, gap analysis — is unnecessary. Walk the structure tree in document order, collect the MCIDs at each leaf, resolve them to marked content sequences on the page, and emit text in that order. + +Zone labeling is also resolved: the structure tree distinguishes headings (`H`, `H1`–`H6`), paragraphs (`P`), list items (`LI`), table cells (`TD`, `TH`), figures (`Figure`), and artifacts (headers, footers, page numbers). No heuristic zone classifier is needed. + +--- + +## 6. Font Embedding Requirements + +PDF/A's font embedding rule is absolute: if a glyph is painted in the document, its outline must be embedded. This applies to all font types: + +| Font Type | Required Key | +|--------------|---------------------| +| Type1 | `FontDescriptor.FontFile` | +| TrueType | `FontDescriptor.FontFile2` | +| CFF/OpenType | `FontDescriptor.FontFile3` | +| Type0 (CID) | Descendant font's `FontDescriptor.FontFile2` or `FontFile3` | + +Subsetting is allowed but must not remove glyphs that appear in the content stream. The subset tag (a six-uppercase-letter prefix in the `BaseFont` name, e.g., `ABCDEF+TimesNewRoman`) identifies subsetted fonts, but all used glyphs are present by definition. + +For extraction, this means: if outline-based fingerprinting is ever needed (e.g., diagnosing a non-conformant Level B document with a broken ToUnicode), the font data is always present to fingerprint against. + +--- + +## 7. Color Space Requirements + +PDF/A forbids bare device-dependent color operators without an `OutputIntent`. Specifically: + +- Operators `RG`/`rg` (DeviceRGB), `K`/`k` (DeviceCMYK), and `G`/`g` (DeviceGray) are only valid if `Catalog.OutputIntents` contains an ICC-based output intent profile. +- All ICC profiles referenced via `ICCBased` color spaces must be embedded. + +For text visibility detection — determining whether text is rendered in a color that contrasts with its background — this simplification means color comparisons always operate in a well-defined space. Converting text and background colors to a common space (via the declared ICC profile) is unambiguous. There are no undefined device-dependent color values that require producer-specific interpretation. + +--- + +## 8. Artifacts and Tagging + +In a Level A document, the artifact mechanism makes the distinction between content and decoration explicit at the byte level. Page elements that are not part of the logical document flow are wrapped in artifact marked-content sequences: + +``` +/Artifact <> BDC + BT ... ET % page number or running header +EMC +``` + +Standard artifact subtypes defined by ISO 32000: `Header`, `Footer`, `Watermark`, `PageNum`, `Bates`, `LineNum`, `Redaction`. Custom types are permitted with `RoleMap` entries. + +This means: page headers, footers, and decorative rules are identified by the document itself. The extraction engine does not need to infer their status from position or font size. Skip all `Artifact`-tagged content when building the logical text output; include it only if the caller requests full-page text (e.g., for header/footer metadata extraction). + +--- + +## 9. PDF/A-3 Embedded Files + +PDF/A-3 lifts the embedded-file prohibition present in PDF/A-1 and PDF/A-2. Every embedded file must declare an `AFRelationship` value in the `EmbeddedFile` stream's dictionary: + +| `AFRelationship` | Meaning | +|------------------|---------| +| `Source` | The embedded file is the source from which the PDF was generated | +| `Data` | Structured data related to the document | +| `Alternative` | A machine-readable alternative rendition of the document content | +| `Supplement` | Supplementary information not contained in the PDF | +| `Unspecified` | Relationship not declared | + +The `Alternative` relationship is significant for extraction: the embedded file may contain the full document text in a structured format (XML, JSON, plain text). The most common real-world case is **ZUGFeRD / Factur-X**: a PDF/A-3 invoice with an embedded XML file (Factur-X XML, `AFRelationship: Alternative`) that contains all invoice fields in machine-readable form. Extracting the embedded XML from a Factur-X document is more reliable than parsing the PDF text layer. + +Enumerate embedded files via `Catalog.Names.EmbeddedFiles` (a name tree) or `Catalog.AF` (an array of file specification dictionaries). Check `AFRelationship` and extract the embedded stream when `Alternative` or `Data` is present. + +--- + +## 10. Extraction Strategy by Conformance Level + +Detect conformance early — immediately after parsing the catalog, before any content stream processing — and branch into the appropriate extraction path. + +### Level B (`pdfaid:part` 1/2/3/4, `pdfaid:conformance` B or F or E) + +- Font data always present: skip external font database lookups. +- ToUnicode not guaranteed: run standard two-stage encoding resolution (ToUnicode → glyph name → shape fingerprint). +- Reading order: use heuristic column/block sort. +- Artifacts: not reliably identified; apply heuristic header/footer detection. + +### Level U (`pdfaid:part` 2 or 3, `pdfaid:conformance` U) + +- Assert ToUnicode present on every font; error-log if absent. +- Decode all text exclusively via ToUnicode CMaps. Skip glyph name resolution and fingerprinting. +- Reading order: still heuristic (no StructTree guarantee). +- Performance gain: eliminates the most expensive fallback path. + +### Level A (`pdfaid:part` 1/2/3/4, `pdfaid:conformance` A) + +- All Level U guarantees apply. +- Walk `StructTreeRoot` in tree order to determine reading order and zone labels. +- Skip the heuristic reading-order algorithm entirely. +- Skip heuristic header/footer detection: artifacts are explicitly marked. +- Emit text in structure-tree order; annotate output with structure element types. + +### Output Metadata + +Report `pdfa_level` in the extraction output metadata: + +```rust +pub enum PdfaLevel { + None, + Part1B, Part1A, + Part2B, Part2U, Part2A, + Part3B, Part3U, Part3A, + Part4F, Part4E, +} +``` + +This allows callers to know the confidence level of the extracted text and to request the fast path explicitly when processing large batches of known-compliant archival documents. + +### Trust Hierarchy + +When the declared conformance level implies a guarantee (e.g., ToUnicode always present for Level U), verify the assumption on the first font encountered. If the document is non-conformant, downgrade the active level, emit a diagnostic, and continue with the full fallback pipeline. Never assume compliance is infallible — archival workflows do produce non-conformant files that declare PDF/A. diff --git a/docs/research/scanned-vs-vector-page-classification.md b/docs/research/scanned-vs-vector-page-classification.md new file mode 100644 index 0000000..8c2e1ba --- /dev/null +++ b/docs/research/scanned-vs-vector-page-classification.md @@ -0,0 +1,203 @@ +# Scanned vs. Vector Page Classification + +## Overview + +Before `pdftract` can extract text from a PDF page, it must decide which pipeline to invoke. Routing a scanned page through vector extraction yields zero or garbled output. Routing a born-digital page through OCR wastes CPU and often produces lower-accuracy text than the embedded encoding. The classifier runs before any extraction work and produces a routing decision — one of `vector`, `ocr`, `hybrid`, or `assisted_ocr` — along with per-signal evidence that callers can inspect. + +This document specifies the classification algorithm in full. + +--- + +## 1. Page Type Taxonomy + +Four categories require distinct extraction pipelines. + +**Pure vector.** All text is encoded in content stream operators (`Tj`, `TJ`, `'`, `"`) with valid character-to-Unicode mappings. Fonts have `ToUnicode` CMaps or their encodings resolve fully through standard lookups. No full-page raster images are present. This is the fast path. + +**Scanned image.** The page is a full-page raster image XObject. Text operators are absent or consist only of an invisible overlay (rendering mode `Tr 3`), which is the PDF/A pattern left by a prior OCR pass. Extraction requires rendering the page to a raster and running OCR. + +**Hybrid.** Some regions contain valid vector text; other regions contain embedded image XObjects carrying text visible only as pixels. The classifier must produce a per-region routing map rather than a single page-level decision. + +**Broken vector.** Text operators are present and font metrics are non-degenerate, but the font encoding is wrong or missing. Extracted character codes do not map to readable Unicode. The page looks like a vector page to a shallow scan but produces garbled or empty output. The classifier must detect the failure without running a full extraction pass, using density signals and partial decoding probes. + +--- + +## 2. Fast Pre-Checks + +Before committing to expensive image analysis or font decoding probes, evaluate three fast signals in order. Any definitive match short-circuits the rest of the classifier. + +**No text operators.** Parse the content stream and count `BT`/`ET` block pairs that contain at least one text-showing operator (`Tj`, `TJ`, `'`, `"`). If no `BT` blocks exist, or every `BT`/`ET` pair is empty, the page has no vector text. With an image XObject present, classify as `scanned`; with neither, classify as `empty`. + +**Invisible text only.** Parse every `Tr` (text rendering mode) operator within `BT`/`ET` blocks. If all text operators are guarded by `Tr 3` (invisible mode) and at least one full-page image XObject is present, the page is a scanned image with an OCR overlay. Classify as `scanned` and mark `has_ocr_layer: true`. This overlay should be used as a fallback hint in the `assisted_ocr` pipeline, not treated as authoritative. + +**High-confidence vector.** If extracted character count exceeds the density floor (section 4), no image XObject covers more than 30% of page area, and character validity rate (section 5) exceeds 0.85, classify immediately as `vector`. This branch exits the classifier early for the majority of born-digital pages. + +--- + +## 3. Image Coverage Analysis + +For pages that do not exit via fast pre-checks, compute the fraction of page area covered by image XObjects. + +For each `Do` operator that references an image XObject, retrieve the Current Transformation Matrix (CTM) at the point of invocation. An image XObject is defined in its own coordinate space as a unit square `[0,0,1,1]`. Transform this unit square through the CTM to obtain the rendered bounding box in page coordinates. Use the four corners of the transformed parallelogram to compute an axis-aligned bounding box (AABB). Sum the AABB areas across all image XObjects (handling overlaps by clipping against the union). + +``` +image_coverage_fraction = clipped_image_area / page_area +``` + +If `image_coverage_fraction > 0.80`, the page is predominantly image-based. Combine with text operator signals: + +- Coverage > 0.80, no text operators → `scanned` +- Coverage > 0.80, invisible text only → `scanned` (with OCR layer) +- Coverage > 0.80, valid text operators → `hybrid` candidate, proceed to region analysis (section 7) + +To distinguish full-page background images from content images, evaluate position and aspect ratio. A full-page background image has an AABB whose top-left corner is within 5% of page width/height of the page origin and whose dimensions are within 10% of the page dimensions. Images that do not match this geometry are content images and do not contribute to the background coverage fraction used for routing. + +--- + +## 4. Text Operator Density + +Count the number of character codes emitted by all text-showing operators on the page. Normalize by an expected character count derived from page dimensions and a reference text density. + +For a standard A4 page (595 × 842 points) at 10pt font size with normal line spacing and margins, expected character count is approximately 3000–4000. Scale linearly for non-standard page sizes. Compute `density_ratio = extracted_char_count / expected_char_count(page_area)`. + +A `density_ratio < 0.05` despite the presence of text operators is diagnostic of broken vector: the font is present but almost no characters decode. A ratio in `[0.05, 0.30)` is ambiguous — sparse typesetting or partial encoding failure. A ratio above 0.30 is consistent with valid vector text. + +This check must operate on raw character codes from the content stream before ToUnicode mapping, measuring how many characters the font machinery attempts to emit rather than how many produce valid Unicode. Validity is assessed separately in section 5. + +--- + +## 5. Character Validity Rate + +After decoding character codes through the font's encoding chain (Encoding dictionary, ToUnicode CMap, CIDToGIDMap), apply the readability checks defined in `text-readability-validation.md` to compute a per-page validity rate. + +The three primary signals: + +**Replacement character density.** `replacement_ratio = fffd_count / total_codepoints > 0.10` flags that font as broken. + +**PUA density.** `pua_ratio > 0.40` over U+E000–U+F8FF and U+F0000–U+FFFFF indicates a missing ToUnicode map. + +**Symbol font bleed-through.** Codepoints mapping to Zapf Dingbats or Symbol block ranges appear when a non-text font is decoded using a text font's encoding. + +Compute these per-font and aggregate to a page-level `character_validity_rate`: + +``` +character_validity_rate = valid_chars / total_chars +``` + +where `valid_chars` excludes U+FFFD, PUA codepoints above the 5% tolerance, and control characters outside U+0009 and U+000A. + +If `character_validity_rate < 0.70`, classify the page as `broken_vector` and route to OCR fallback. If `character_validity_rate` falls in `[0.70, 0.85)`, route to `assisted_ocr`: use the partial vector output as position hints for OCR alignment but do not trust the decoded text as final output. + +--- + +## 6. Glyph Bounding Box Plausibility + +For each extracted glyph, compute its bounding box in page coordinates by applying the text matrix, text line matrix, and CTM. Plausibility bounds: + +- **Width:** `[0.01 × font_size, 2.0 × font_size]`. Glyphs narrower than 1% of the font size are likely zero-width artifacts from a dummy text layer. Glyphs wider than twice the font size are likely the result of a corrupt text matrix. +- **Height:** `[0.3 × font_size, 3.0 × font_size]`. Heights below 30% suggest degenerate scaling; heights above 300% suggest an incorrect CTM. + +Compute `implausible_bbox_fraction = implausible_glyph_count / total_glyph_count`. If this exceeds 0.20, the font matrix or CTM is corrupted. Route to OCR fallback. + +Additionally, for consecutive glyphs on the same text line, compute the intersection-over-union (IoU) of their bounding boxes. If more than 50% of adjacent pairs have IoU > 0.50, glyphs are stacked at a single position — a common pattern in dummy text layers from low-quality OCR tools. Flag as `broken_vector` and suppress the text layer. + +--- + +## 7. Region-Level Hybrid Detection + +When `image_coverage_fraction` is in `[0.20, 0.80)` and text operators are present, the page requires per-region analysis. + +Partition the page into rectangular regions using the AABBs of image XObjects as boundaries. For each region: + +1. Collect all text operator glyph bboxes whose center falls within the region. +2. If glyph count is zero, the region is image-only → assign `ocr`. +3. If glyph count is non-zero, apply character validity rate and bbox plausibility checks to glyphs within the region. If `character_validity_rate >= 0.85` and `implausible_bbox_fraction < 0.20`, assign `vector`. Otherwise assign `ocr`. + +The result is a region routing map: + +```rust +pub struct RegionRoute { + pub bbox: [f32; 4], // [x0, y0, x1, y1] in page coordinates + pub method: ExtractionMethod, +} +``` + +The extraction engine uses this map to run vector extraction over `vector` regions and rasterize + OCR only the image-covered regions, avoiding a full-page OCR pass when much of the page is efficiently extractable via vector. + +--- + +## 8. Confidence Scoring and Routing Decision + +Each classification signal contributes evidence toward four confidence scores: `vector_confidence`, `scan_confidence`, `hybrid_confidence`, and `broken_vector_confidence`. Signals are weighted additively: + +| Condition | Route | +|---|---| +| `vector_confidence > 0.80` | `vector` | +| `scan_confidence > 0.80` and `character_validity_rate < 0.20` | `ocr` | +| `broken_vector_confidence > 0.60` | `ocr` (fallback) | +| Both vector and scan confidence moderate | `hybrid` | +| `character_validity_rate ∈ [0.70, 0.85)` | `assisted_ocr` | + +Signal weights (normalized to sum 1.0 per dimension): + +- No text operators: +0.9 scan +- Invisible text only (`Tr 3`): +0.7 scan, +0.3 scan OCR-layer +- `image_coverage_fraction > 0.80`: +0.6 scan +- `density_ratio > 0.30` and `character_validity_rate > 0.85`: +0.8 vector +- `character_validity_rate < 0.70`: +0.7 broken-vector +- `implausible_bbox_fraction > 0.20`: +0.5 broken-vector +- Adjacent glyph IoU > 0.50 on majority: +0.6 broken-vector, +0.4 scan + +--- + +## 9. Cost-Aware Routing + +OCR is expensive — a Tesseract pass on a 300 DPI A4 raster typically takes 1–3 seconds. Cost gating rules: + +If `vector_confidence > 0.80`, skip OCR even on hybrid pages for vector regions. Invoke OCR only for regions where vector extraction failed or produced low-confidence output. For `assisted_ocr` pages (`character_validity_rate ∈ [0.70, 0.85)`), pass partial vector output as bounding-box hints to the OCR engine to reduce search space. + +The `ocr_threshold` configuration parameter sets the minimum `character_validity_rate` below which OCR is attempted. Default is `0.85`; archival document pipelines with predictably broken encodings should lower it to `0.60` or below. + +```toml +[extraction] +ocr_threshold = 0.85 +``` + +--- + +## 10. Output Metadata + +The classifier emits structured metadata for every page, available in the extraction output regardless of which pipeline ran: + +```rust +pub struct PageClassificationMeta { + pub extraction_method: ExtractionMethod, // vector | ocr | hybrid | assisted_ocr + pub vector_confidence: f32, + pub ocr_confidence: f32, + pub classification_signals: Vec, + pub image_coverage_fraction: f32, + pub text_operator_count: u32, + pub character_validity_rate: f32, + pub region_routes: Option>, // populated for hybrid pages +} + +pub enum ClassificationSignal { + NoTextOperators, + InvisibleTextOnly, + HighImageCoverage, + LowDensityRatio { density_ratio: f32 }, + LowCharacterValidity { rate: f32 }, + ImplausibleGlyphBboxes { fraction: f32 }, + AdjacentGlyphOverlap { fraction: f32 }, + FullPageBackgroundImage, + OcrLayerDetected, +} +``` + +`classification_signals` is an ordered array of the signals that fired, enabling callers to audit classification decisions, build quality dashboards, and tune the `ocr_threshold` based on observed validity rates across their document corpus. + +--- + +## Implementation Notes + +The classifier runs as a single-pass content stream parse. Image XObject AABBs are computed during the same parse as text operator collection, sharing the CTM stack. No font decoding is required for fast pre-checks or image coverage stages; font probes for character validity are lazy and terminate early once sufficient evidence accumulates. The region routing map for hybrid pages is computed in a second pass over the accumulated evidence, keeping the first pass allocation-free beyond the glyph accumulator buffer. diff --git a/docs/research/word-boundary-reconstruction.md b/docs/research/word-boundary-reconstruction.md new file mode 100644 index 0000000..e144cb2 --- /dev/null +++ b/docs/research/word-boundary-reconstruction.md @@ -0,0 +1,202 @@ +# Word Boundary Reconstruction + +## Problem Statement + +A substantial fraction of real-world PDFs — especially those produced by TeX/LaTeX toolchains, legacy CAD exporters, and older desktop publishing systems — contain no explicit space characters (U+0020) in their content streams. The visual whitespace between words is produced entirely through glyph positioning arithmetic. When a text extractor naively concatenates glyph-to-Unicode mappings without accounting for positional gaps, every word runs together and the output is unreadable. Reconstructing word boundaries is therefore one of the highest-impact correctness problems in PDF text extraction. + +--- + +## 1. Why Spaces Are Missing + +The PDF content stream model does not require producers to emit space characters. The spec defines word spacing (`Tw`) and character spacing (`Tc`) as graphics state parameters precisely because positioning is expected to substitute for literal space glyphs. + +**TeX/dvips and pdfTeX** operate character-by-character. Each glyph is placed at an absolute or relative position computed by TeX's box-and-glue model. Inter-word glue is converted to a `Td` offset or a positive numeric element inside a `TJ` array; no 0x20 byte ever appears in the string arguments. This is by design: TeX fonts often lack a space glyph entirely, and the Type 1 / Type 2 charstring for character code 0x20, if present, has zero advance width. + +**Advance-width substitution** is the general pattern: rather than encoding a space glyph, authoring tools advance the text position by a computed amount equal to the intended inter-word gap, then begin the next word. The result is visually identical to a space but structurally absent from the character stream. + +--- + +## 2. Glyph Advance Width and Position + +Every glyph has an advance width defined in the font's metric tables. In PDF: + +- **Type 1 / TrueType fonts**: the `Widths` array in the font dictionary maps character codes to glyph widths in 1/1000 of the font's em unit. +- **CIDFonts**: the `DW` key provides a default advance width; the `W` key provides per-glyph overrides as a compact run-length encoding. + +After rendering glyph `g` whose advance width is `w_g` (in glyph units), the text position advances to: + +``` +x_next_expected = x_current + (w_g * font_size / 1000) +``` + +If the actual x-position of the following glyph deviates positively from `x_next_expected` by more than a threshold, a gap exists. The magnitude of that gap determines its semantic: a small gap is likely a word space; a larger gap may indicate a sentence boundary, a tab stop, or a column separator. + +--- + +## 3. Computing Expected Position Accurately + +The simplified formula above omits three graphics state parameters that the PDF spec requires to be applied: + +``` +x_next_expected = + x_current + + (w_g / 1000 * font_size + Tc + Tw_if_space) * Tz / 100 +``` + +Where: + +- **`Tc`** (character spacing, set by the `Tc` operator): added to the advance of every glyph. +- **`Tw`** (word spacing, set by the `Tw` operator): added after any single-byte glyph whose character code is 0x20 only. For multi-byte encodings this term never applies. +- **`Tz`** (horizontal scaling percentage, set by the `Tz` operator, default 100): scales the entire horizontal advance. + +Failure to apply `Tc` and `Tz` causes systematic over- or under-estimation of expected positions and produces false gap detections. A text matrix transformation (from `Tm` or `Td`) must be applied to convert glyph-space expected positions into device space before comparing with the next glyph's actual device-space coordinates. + +--- + +## 4. The Gap Threshold + +The central parameter is the minimum gap magnitude that triggers space insertion. Several strategies exist; an adaptive combination is most robust: + +**Fixed fraction of font size.** A gap exceeding `0.2 * font_size` is commonly cited. This works for typical roman typefaces at body text sizes but breaks for narrow condensed faces or for documents that mix font sizes. + +**Fraction of average glyph width.** Compute the mean advance width of the glyphs observed on the current text line (excluding outliers). A gap exceeding `0.3 * mean_advance` adapts better to condensed or wide typefaces. + +**Font space glyph width.** If the font's `Widths` array contains an entry for character code 0x20, that width (converted to device units as `w_space * font_size / 1000`) is the canonical space reference. This is the most accurate signal when available. + +**Fallback half-em.** When no space glyph is defined, use 500 glyph units (half the em) as the reference width: `0.5 * font_size`. + +**Adaptive histogram method.** Collect all observed inter-glyph gaps on a page. The distribution is typically bimodal: a sharp peak near zero (tight kerning pairs) and a broader peak near the space width. Fit or locate these two peaks; use the valley between them as the threshold. This requires sufficient glyph count (at least ~50 gaps) to be reliable and can be computed incrementally per-font-size class. + +In practice, use the font space glyph width when available, fall back to the adaptive histogram when sufficient data exists, and use `0.25 * font_size` otherwise. + +--- + +## 5. TJ Operator Kerning Arrays + +The `TJ` operator accepts an array whose elements alternate between byte strings and numeric offsets. A numeric element displaces the text position by `-offset * font_size / 1000` (the sign convention is reversed from normal advance: positive values move left, negative move right — i.e., positive offsets are backward). + +Wait — to be precise per the PDF spec: the displacement is `-(offset / 1000) * font_size` in text space. A **negative** numeric element therefore moves the position forward (adds gap); a **positive** element kerns tighter (moves backward). TeX uses negative offsets for kerning between adjacent letters and large negative offsets (typically below −250 in 1000-unit space) to implement word separation. + +The space-detection rule for `TJ` numeric elements: + +``` +if offset < -space_threshold_in_glyph_units { + insert_space() +} +``` + +Where `space_threshold_in_glyph_units` maps the device-space threshold back to 1000-unit glyph space: `threshold_device * 1000 / font_size`. TeX-generated PDFs commonly use offsets around −250 to −350 to represent a normal inter-word space in a 1000-unit font. Treat each transition between a string element and a numeric element, and back to a string, as a potential gap site. + +--- + +## 6. Td/TD/Tm Positioning + +When the PDF content stream transitions between text positioning commands, the text matrix changes. Relevant operators: + +- **`Td tx ty`**: moves the text line position by `(tx, ty)` in text space. +- **`TD tx ty`**: same as `Td` but also sets `TL = -ty`. +- **`Tm a b c d e f`**: sets the text matrix directly. + +Between consecutive text painting operators (Tj, TJ, ' ", etc.), if the text matrix changes such that the new horizontal position in device space exceeds `x_last_glyph_end` by more than the space threshold, insert a space. + +Rules: + +- **Positive horizontal jump** (new x > expected x by threshold): insert a space. +- **Negative horizontal jump** (new x < expected x): do not insert a space; this is a backtrack, indicating overlapping text, a correction, a superscript/subscript, or right-to-left text reordering. Log as a `backtrack` event in debug metadata. +- **Jump between `BT`/`ET` blocks**: treat the start of each new text object as a potential word boundary using the same threshold rule, comparing the new block's starting position to the ending position of the last glyph from the previous block. + +--- + +## 7. Vertical Gap Interpretation + +A change in the y-coordinate of the text position signals a line change rather than a word gap. The threshold: + +``` +if abs(delta_y) > 0.5 * line_height { + emit line break +} +``` + +Where `line_height` is approximated as the current font size multiplied by the leading factor (default 1.2 if no explicit `TL` is set). A vertical gap exceeding approximately 1.5× the line height with no intervening content suggests a paragraph break. + +Output conventions: + +- **Line break**: emit `\n`. +- **Paragraph break**: emit `\n\n`. +- **Continuation on same line after vertical micro-adjustment** (|Δy| < 0.1 × font_size): treat as same line, no break; this covers subscript/superscript corrections. + +Avoid inserting a horizontal space when a vertical line break is also emitted, as the two are mutually exclusive for a given gap event. + +--- + +## 8. Font-Specific Space Width + +The space threshold must be font-local. A narrow condensed typeface may have an inter-word space of only 150 glyph units (15% of em), while a wide serif face may use 350 units (35%). Using a global threshold produces both false positives (splitting ligatures) and false negatives (missing spaces in dense faces). + +Resolution strategy (in priority order): + +1. Look up character code 0x20 in the font's `Widths` array. If present and nonzero, use it. +2. For CIDFonts, look up CID 0x0020 in the `W` array, then fall back to `DW`. +3. Consult the font's `FontDescriptor` for `MissingWidth`; if the space glyph is absent, this is the width assigned to unknown glyphs (often useful as a lower bound). +4. If all metrics are absent, use 500 glyph units as the default half-em heuristic. +5. Override with the adaptive histogram estimate when ≥50 inter-glyph gaps are available for the current font at the current nominal size. + +Cache the resolved space width per `(font_resource_name, font_size)` pair to avoid redundant lookups per glyph. + +--- + +## 9. Multi-Column Gap vs. Word Gap + +A horizontal gap exceeding approximately `2 * font_size` in device space on the same baseline is not a word gap — it is a tab stop, column separator, or layout gutter. Inserting a space at such a site produces a run of text that incorrectly merges content from separate columns. + +Detection heuristic: if `delta_x > 2.0 * font_size` and `abs(delta_y) < 0.1 * font_size`, classify the gap as a **layout gap** rather than a word gap. The appropriate response depends on the layout mode: + +- In single-column mode: preserve as a sequence of tab characters or whitespace (extractor-configuration-dependent). +- In multi-column mode: treat as a column boundary and do not concatenate the two spans into the same text run at this point; defer ordering to the reading-order algorithm. + +This decision point integrates with the column detection logic described in `complex-layout-reading-order.md`. The word-boundary reconstructor should expose the gap classification (`word_gap`, `layout_gap`, `line_break`, `paragraph_break`) in its span metadata so that the layout stage can consume it without re-deriving it. + +--- + +## 10. Output and Configuration + +**Inferred space tagging.** Explicitly encoded space glyphs (character code 0x20 present in the stream) and inferred spaces (inserted by gap detection) must be distinguishable in the intermediate representation. Each inferred space span carries `inferred: true` in its debug metadata. This enables downstream consumers to audit false positives without reprocessing the PDF. + +**Configuration parameter: `space_detection_threshold`.** Expose a per-extractor configuration value: + +```rust +pub enum SpaceThreshold { + /// Automatically select per font using the priority strategy above. + Auto, + /// Fixed fraction of font size (e.g., 0.25). + FractionOfFontSize(f32), + /// Absolute value in device-space points. + AbsolutePoints(f32), +} +``` + +Default: `SpaceThreshold::Auto`. When `Auto`, the extractor uses font metric lookups with adaptive histogram fallback. Callers processing documents where every inter-word gap is explicit can set `SpaceThreshold::AbsolutePoints(f32::MAX)` to disable inference entirely. + +**Per-page statistics.** The `PageOutput` structure exposes: + +```rust +pub struct PageSpaceStats { + pub explicit_space_count: u32, + pub inferred_space_count: u32, + pub backtrack_event_count: u32, + pub layout_gap_count: u32, +} +``` + +A high `inferred_space_count` relative to `explicit_space_count` (ratio > 5:1) is a reliable signal that the document was produced by a TeX toolchain or a similarly space-omitting authoring system. This signal can inform downstream heuristics such as ligature normalization and hyphenation handling. + +--- + +## Implementation Notes for Rust + +- Maintain a `TextState` struct that tracks `Tc`, `Tw`, `Tz`, `font_size`, `text_matrix`, and `line_matrix` as mutable graphics state, updated by the corresponding PDF operators. +- After each glyph is rendered, record `glyph_end_x` (device space) as `glyph_start_x + advance_device`. +- Before rendering the next glyph, compute `expected_x` from the full formula including `Tc` and `Tz`; compare actual x to `expected_x`; classify and emit gap events. +- For `TJ` arrays, iterate elements in order; accumulate string runs and emit gap events at each sign-significant numeric element before consuming the next string run. +- Store the font space width cache in a `HashMap<(ObjectId, OrderedFloat), f32>` keyed by font object ID and nominal font size to handle fonts used at multiple sizes. +- The adaptive histogram should bucket gaps into bins of width `0.01 * font_size` and perform a simple two-peak scan (find the global maximum, zero out ±3 bins, find the second maximum) to locate the space-width peak without a full GMM fit.