From eac323529146f8fc848ad2b1e767ba7ad1e5b41e Mon Sep 17 00:00:00 2001 From: jedarden Date: Sat, 16 May 2026 15:35:48 -0400 Subject: [PATCH] Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs Four new extraction research documents covering text rendering modes (Tr 0-7 including invisible OCR layers), legal/financial document extraction patterns, character-level confidence aggregation with output schema, and PDF/E engineering document handling (CAD, GD&T, schematics). Co-Authored-By: Claude Sonnet 4.6 --- .../confidence-scoring-and-aggregation.md | 163 ++++++++++++++++++ .../engineering-document-extraction.md | 63 +++++++ .../legal-and-financial-pdf-patterns.md | 89 ++++++++++ docs/research/stroke-and-outlined-text.md | 91 ++++++++++ 4 files changed, 406 insertions(+) create mode 100644 docs/research/confidence-scoring-and-aggregation.md create mode 100644 docs/research/engineering-document-extraction.md create mode 100644 docs/research/legal-and-financial-pdf-patterns.md create mode 100644 docs/research/stroke-and-outlined-text.md diff --git a/docs/research/confidence-scoring-and-aggregation.md b/docs/research/confidence-scoring-and-aggregation.md new file mode 100644 index 0000000..c974c48 --- /dev/null +++ b/docs/research/confidence-scoring-and-aggregation.md @@ -0,0 +1,163 @@ +# Character-Level Confidence Scoring and Span Aggregation + +## Why Per-Character Confidence Matters + +PDF text extraction is not a uniform process. Within a single span — a contiguous run of characters sharing the same font, size, and rendering mode — individual characters may originate from completely different recovery paths. One glyph resolves cleanly through a ToUnicode entry. The adjacent glyph has no ToUnicode mapping and is recovered by AGL lookup from the glyph name. A third glyph in the same word has no name and is matched by shape fingerprint with a similarity of 0.71. A fourth falls through entirely to OCR. + +If confidence is tracked only at the word or span level, this heterogeneity is invisible to consumers. A span reported as "medium confidence" may contain a mix of high-confidence characters and completely guessed ones. A word-level score papers over exactly the glyphs most likely to contain extraction errors. Per-character confidence preserves the information needed to reconstruct which specific positions in the output are reliable and which are not, enabling downstream consumers — search indexers, entity recognizers, document QA systems — to weight their processing accordingly. + +## Confidence Sources and Their Native Granularity + +Each extraction path exposes confidence at a different granularity, and the aggregation strategy must account for the mismatch between the native signal and the character level at which pdftract operates. + +**ToUnicode** is binary: a code point either has a ToUnicode entry or it does not. When a mapping is present and valid, the character is assigned a confidence of `1.0`. When absent, the character falls to the next recovery path. The confidence is per code point, which maps cleanly to per character. + +**AGL (Adobe Glyph List) recovery** uses the glyph name embedded in the font. A successful AGL lookup — for example, `fi` resolving to U+FB01 or `Agrave` resolving to U+00C0 — is assigned `0.95`. The name lookup is deterministic and the AGL is exhaustively specified, so near-full confidence is warranted. The small discount from 1.0 accounts for fonts that reuse AGL names for custom glyphs (a known pathology in older PDF producers). + +**Shape fingerprint matching** produces a continuous similarity score in `[0.0, 1.0]`. The score is derived from the cosine similarity of normalized contour feature vectors, optionally combined with aspect ratio and stroke width penalties. This score is the most informative raw signal in the pipeline and maps directly to per-character confidence without transformation. + +**Tesseract HOCR** reports `x_conf` at the word level, as an integer in `[0, 100]`, normalized to `[0.0, 1.0]` by dividing by 100. Character-level confidence is not natively available from HOCR output. Per-character confidence within an OCR-recovered word is therefore uniform: every character in the word receives the same score as the word. This is a known limitation and is flagged in the `confidence_source` field so consumers can interpret the score accordingly. + +**Synthetic characters** (spaces inserted at gap thresholds, hyphens inferred from line geometry, soft hyphens suppressed during normalization) are assigned `1.0` if the structural inference is deterministic, or a configurable value (default `0.85`) when heuristic. + +## Aggregating Character Confidence to Word Confidence + +Given per-character confidences `c_1, c_2, ..., c_n` for the `n` characters in a word, three aggregation strategies are worth considering: minimum, arithmetic mean, and harmonic mean. + +The **minimum** is maximally conservative: the word is only as reliable as its least reliable character. This is appropriate for applications where a single wrong character invalidates a token (entity recognition, numeric extraction). + +The **arithmetic mean** is the conventional choice and gives equal weight to each character position. For a five-character word with four ToUnicode characters (`c = 1.0`) and one shape match (`c = 0.71`), the mean is `0.943`. + +The **harmonic mean** is defined as: + +``` +H(c_1..n) = n / Σ(1/c_i) +``` + +For the same example: `5 / (4×(1/1.0) + 1/0.71) = 5 / (4 + 1.408) = 5 / 5.408 ≈ 0.924`. + +The harmonic mean penalizes outliers more aggressively than the arithmetic mean. One very low confidence character has an outsized downward pull because `1/c_i` grows rapidly as `c_i` approaches zero. This property is desirable: a word where one glyph is a shape-match guess with similarity `0.40` should not receive a word confidence close to `1.0` merely because the other characters are clean. pdftract uses the harmonic mean as its default word-level aggregation, with minimum and arithmetic mean available as configuration options. + +## Aggregating Word Confidence to Span and Block Confidence + +Word confidence scores are aggregated to span confidence using a character-count-weighted mean, not a word-count-weighted mean. Words vary in length, and a two-character word and a twelve-character word should not contribute equally to the span score. The formula is: + +``` +span_confidence = Σ(word_confidence_i × char_count_i) / Σ(char_count_i) +``` + +Block confidence applies the same formula across all spans in the block, weighted by the character count of each span. + +## The `confidence` Field on a Span + +The `confidence` field on a span is a `f32` in `[0.0, 1.0]`. It represents the character-count-weighted harmonic-mean aggregation of per-character confidence scores across all words in the span. A value of `1.0` means every character in the span was recovered via ToUnicode. A value near `0.0` means the span is effectively unextractable by non-OCR paths and OCR itself returned low confidence. + +Downstream consumers should treat this as an estimate of the probability that the extracted text accurately represents the source glyphs, not as a strict probability of correctness. The field is always present; it is never `null` or omitted. + +```rust +pub struct Span { + pub text: String, + pub confidence: f32, + pub confidence_source: ConfidenceSource, + pub bbox: Rect, + pub font_name: Option, + pub font_size: f32, +} +``` + +## Confidence Tiers + +Four tiers are defined for reporting and CLI output: + +| Tier | Range | Typical extraction path | +|---|---|---| +| High | ≥ 0.95 | ToUnicode or full AGL coverage | +| Medium | 0.70 – 0.94 | Partial AGL, shape fingerprint matches ≥ 0.80 | +| Low | 0.40 – 0.69 | Shape fingerprint matches 0.40–0.79, OCR on clean scans | +| Unextractable | < 0.40 | OCR on degraded scans, no viable shape match | + +These boundaries are not arbitrary. The `0.95` high-confidence floor excludes spans with any AGL-recovered glyph at `0.95` scaled down by word-level harmonic mean, ensuring that only spans where every character is either ToUnicode or high-quality AGL qualify as high-confidence. The `0.40` unextractable floor corresponds to shape match similarity below which empirical error rates exceed 30% in validation against ground-truth corpora. + +## The `confidence_source` Enum + +The `confidence` scalar alone is insufficient for downstream interpretation. A score of `0.85` from ToUnicode (which is binary, so this would indicate a word with some non-mapped characters that fell to AGL) means something different from `0.85` from Tesseract HOCR, where the score is word-level and characters within may vary unpredictably. The `confidence_source` field identifies the dominant source: + +```rust +pub enum ConfidenceSource { + ToUnicode, + AGL, + ShapeMatch, + OCR, + Synthetic, + Mixed, // multiple sources within the span +} +``` + +`Mixed` is reported when a span contains characters from more than one source. Consumers that require uniform provenance can split or filter on `Mixed` spans. The field is a string enum in JSON output: `"to_unicode"`, `"agl"`, `"shape_match"`, `"ocr"`, `"synthetic"`, `"mixed"`. + +## Propagating Confidence Through Normalization + +Text normalization — ligature expansion, hyphenation rejoining, diacritic composition — transforms the extracted character sequence after per-character confidence is established. The confidence propagation rule is conservative: the output character inherits the minimum confidence of all input characters that contributed to it. + +When `fi` (U+FB01) is expanded to `f` + `i`, both output characters inherit the ligature's confidence. When two lines joined by a soft hyphen are rejoined into a single token (`con-\nfidence` → `confidence`), the rejoined word's confidence is `min(line1_word_conf, line2_word_conf)`. When precomposed diacritics are synthesized from base character plus combining mark, the composed character's confidence is the minimum of the two components. + +This conservative rule ensures normalization never inflates confidence. A consumer relying on the confidence score receives a lower bound that accounts for all transformations applied to produce the final text. + +## Confidence in JSON Output + +The JSON output schema includes confidence fields at three levels: + +```json +{ + "pages": [ + { + "page_number": 1, + "confidence_summary": { + "mean": 0.91, + "min": 0.43, + "high_pct": 0.72, + "medium_pct": 0.18, + "low_pct": 0.08, + "unextractable_pct": 0.02 + }, + "blocks": [ + { + "confidence": 0.94, + "spans": [ + { + "text": "example", + "confidence": 0.94, + "confidence_source": "to_unicode" + } + ] + } + ] + } + ], + "document_confidence": { + "mean": 0.89, + "estimated_cer": 0.03 + } +} +``` + +`confidence` at the span level is the primary field. Block-level `confidence` is the character-count-weighted mean of its spans. Page-level `confidence_summary` contains `mean`, `min`, and tier percentage breakdowns (`high_pct`, `medium_pct`, `low_pct`, `unextractable_pct`), each representing the fraction of characters (by count) falling into that tier. Document-level `document_confidence` includes an `estimated_cer` (Character Error Rate estimate) derived from the inverse of mean confidence with an empirical calibration factor. + +## Using Confidence for Extraction Quality Reporting + +The CLI `--report` flag emits a structured quality summary after extraction. At the page level, a histogram of confidence bins (10 bins from 0.0 to 1.0) provides a visual distribution of extraction quality. Pages dominated by the 0.0–0.40 bin signal heavy OCR reliance on degraded content and should trigger a warning. + +The document-level CER estimate is computed as: + +``` +estimated_cer = 1.0 - document_mean_confidence × calibration_factor +``` + +where `calibration_factor` is `0.92` by default, derived from validation against documents with ground-truth transcriptions. This estimate is informational and carries a disclaimer in CLI output. + +Threshold-based warnings are emitted when: +- Any page has `unextractable_pct > 0.10` (more than 10% of characters unextractable) +- Document mean confidence falls below `0.70` (Medium tier boundary) +- Any span has `confidence_source = "ocr"` and `confidence < 0.50` + +These warnings are machine-readable in JSON report mode and human-readable in plain text mode, giving integrators a clear signal to route documents through enhanced OCR pipelines or flag them for manual review. diff --git a/docs/research/engineering-document-extraction.md b/docs/research/engineering-document-extraction.md new file mode 100644 index 0000000..eb59f97 --- /dev/null +++ b/docs/research/engineering-document-extraction.md @@ -0,0 +1,63 @@ +# Engineering Document PDF Extraction + +## PDF/E and the Engineering PDF Landscape + +PDF/E-1 (ISO 24517-1) is a PDF 1.6 conformance level designed specifically for the exchange of engineering documents. Beyond the baseline PDF 1.6 feature set, PDF/E-1 mandates or restricts several capabilities relevant to extraction. It requires that all fonts be embedded, eliminating the ambiguity of system font substitution that plagues general-purpose PDF extraction. It prohibits encryption that would prevent conforming readers from rendering content, which means a conforming PDF/E file should always be extractable without decryption barriers. It also defines a formal attachment model that permits embedded 3D content streams in either U3D (Universal 3D) or PRC (Product Representation Compact) format, attached via `RichMedia` annotations or the `3D` annotation type introduced in PDF 1.6. + +The critical distinction for an extraction library is that 3D geometry embedded in these annotations is binary format geometry — vertices, surfaces, B-rep topology, material properties — not text. The annotation itself may carry a text component: an `AP` (appearance stream) that renders a 2D projection or placeholder, a `Contents` entry with a label, and `Measure` dictionaries that can include numeric values and unit strings. These annotation-level text components are legitimate extraction targets. The geometry data stream itself is not. A correct extraction strategy treats 3D annotation content entries and their associated measurement labels as first-class text, while explicitly ignoring the binary 3D stream payload. + +PRC-embedded metadata warrants a separate note. PRC files may contain a product structure tree with assembly names, part names, and attribute strings. When PRC data is embedded as a file attachment (rather than an inline stream), the attachment filename and any `/EmbeddedFile` metadata fields are extractable as document metadata, though the internal PRC tree requires a PRC parser outside the scope of text extraction. + +## Engineering Drawing Structure as an Extraction Model + +A well-structured engineering drawing follows conventions that, when understood, transform extraction from a spatial guessing game into a structured parse. The title block — universally located in the lower-right corner of the sheet — contains a bounded set of labeled fields: document or drawing number, sheet number, revision level, scale, drawn-by, checked-by, approved-by, and date. These fields are vector text rendered in a fixed spatial region. An extraction pass that identifies the lower-right quadrant of a landscape page and groups text clusters within it can reliably reconstruct the title block as structured key-value pairs rather than a stream of isolated glyphs. + +Notes and callouts are positioned throughout the drawing field. Callout text typically appears at the endpoint of a leader line — a graphical path with an arrowhead at the geometry end and text at the annotation end. The text endpoint is the extraction target. Leader line paths in PDF are drawn as graphics operators (`m`, `l`, curve operators) and carry no inherent connection to the text they point to. Spatial proximity is the only available signal: the text cluster nearest to the non-arrowhead end of a leader path is the callout label for that leader. Extraction must preserve these as spatially-associated pairs rather than treating the text as free-floating. + +The bill of materials (BOM) table and the revision history block are the two most structured text regions in a typical drawing. The BOM lists item numbers, part numbers, quantities, descriptions, and often material specifications in a tabular grid. The revision block records revision letter, date, description, and approval initials in a separate table, usually stacked in the lower-right corner above or beside the title block. Both must be extracted as tables — row and column structure intact — not as linear text streams. Line segment detection (horizontal and vertical strokes forming cell boundaries) combined with text clustering within each cell provides the correct reconstruction. + +## CAD-to-PDF Conversion Artifacts + +CAD systems produce PDF through an internal rendering pipeline that converts model annotations, dimensions, and symbols to PDF content streams. This conversion is frequently lossy in ways that complicate extraction. Exploded dimension text is the most common artifact: a linear dimension that appears to a human as a single object — say, `24.500 ±0.005` — may be stored in the PDF as three separate text objects at three separate positions: the nominal value, the tolerance value, and the unit string, each placed relative to the dimension line geometry. An extraction that simply serializes glyphs in reading order may interleave these fragments with other nearby text, producing output like `24.500 R0.375 ±0.005 [4×]`. + +Recovering exploded dimension text requires recognizing that dimension annotation components cluster tightly around a dimension line path, that their bounding boxes often overlap in one axis, and that the reading order within a dimension cluster is determined by the dimension type (linear horizontal, linear vertical, radial, angular) rather than by absolute x/y position. Grouping logic that detects these clusters and serializes them as a unit — before the global reading-order sort — is the correct approach. + +GD&T symbols present a character-level challenge. GD&T uses a defined symbol vocabulary: ⌀ (diameter), ⊕ (position), ⊙ (circularity), ⌖ (concentricity), ⊘ (symmetry), ▷ (flatness indicator in some conventions), and others. In well-produced PDFs, these appear as Unicode characters (U+2205, U+2295, U+2299, etc.) embedded in a symbol font with correct ToUnicode mappings. In poorly-produced PDFs, they appear as glyphs in a proprietary font with no ToUnicode table, mapping to arbitrary code points. Extraction must attempt ToUnicode lookup first, fall back to glyph-name-to-Unicode mapping using the AGL (Adobe Glyph List) and the engineering symbol extensions, and for truly unmapped glyphs, use glyph outline shape matching against a reference set of GD&T symbols to identify and substitute the correct Unicode code point. Silently dropping unmapped glyphs produces output that looks like `∅0.010` but is actually `` 0.010` — invisible damage to safety-critical specifications. + +## Technical Manuals: Procedures and Safety Callouts + +Technical manual PDFs share structural features with legal documents but carry safety-critical content that makes extraction fidelity non-negotiable. Numbered procedures are hierarchically structured: step 1., substep 1.1, action 1.1.a. The indentation level and numbering scheme together define the hierarchy. PDF does not encode this hierarchy; it must be inferred from x-position (indentation depth) and the numeric prefix pattern. + +Warning, Caution, and Note callout boxes are a distinctive feature of technical manuals following ANSI Z535 or MIL-STD-38784 conventions. These appear as bordered boxes, often with the label in a distinct font weight or color (red or orange for WARNING, yellow for CAUTION, blue or black for NOTE). The bordered box is a graphics element; the label and body text inside are separate text streams. Extraction must identify these box-and-text composites and tag the resulting text with its callout type — not merely serialize the words "WARNING" along with the body text as if they were paragraph prose. A WARNING that loses its semantic marking becomes invisible in downstream processing. + +Figure references (`See Figure 3-4`, `refer to Detail B`) and parts list references (`P/N 45-8812-002`) appear throughout manual text and link across pages. These are text extraction targets with no special handling required at the extraction layer, but they must survive with their alphanumeric content intact — dashes, slashes, and dots in part numbers are frequently dropped by naive tokenizers. + +## Schematic PDFs: Spatial Context for Text Labels + +Electrical schematics and P&ID (Piping and Instrumentation Diagram) PDFs present the spatial-grouping problem in its most extreme form. Every text element — component reference designators (R1, C47, U3), wire labels (net names, voltage rails), tag numbers (FV-101, TIC-204) — is positioned relative to a symbol or wire graphic with no structural link in the PDF content stream. The symbol is a set of vector paths; the label is a nearby text object; the association is purely spatial. + +Extraction strategy must segment a schematic page into spatial neighborhoods, cluster text within each neighborhood around its parent symbol or wire segment, and emit the text with its spatial context preserved. For P&ID specifically, ISA 5.1 tag numbers follow a structured format (instrument function letters followed by loop number) that can be validated post-extraction to catch OCR or encoding errors. + +## Tolerance Notation and Special Characters + +Tolerance and specification notation in engineering PDFs depends on correct Unicode round-tripping. The ± symbol (U+00B1) must survive extraction as a single character, not as a `+` followed by a `-` stacked via vertical offset. Superscript and subscript characters — common in unit expressions like `N/m²` (U+00B2) or `10⁶` — may be rendered in PDF as normal-size characters with a vertical baseline offset rather than as Unicode superscript code points. Extraction must detect the baseline offset pattern and, where the character is in the range that has a defined Unicode superscript equivalent (digits 0–9, n, i), substitute the correct Unicode code point. Where no Unicode superscript exists, the text should be emitted with a markup convention (e.g., `^{text}`) rather than silently dropped or merged with adjacent baseline text. + +Fractions are similarly fragile. A fraction like `3/8` may be a single Unicode vulgar fraction (U+2158) or three separate characters. A mixed number like `1 3/8"` may be five characters or a combination of a regular `1`, a Unicode vulgar fraction, and an inch symbol. Both representations must extract to the same canonical form. + +## Multi-Sheet Documents and Sheet Metadata + +Large engineering documents are multi-sheet PDFs where each page corresponds to a numbered drawing sheet. Sheet metadata — sheet number, total sheet count, drawing number, revision — appears in the title block of each sheet and must be extracted per-page, not aggregated. A drawing index sheet (often sheet 1 of N) lists all sheet numbers with their titles and may be structured as a table. Cross-sheet references (`See Sheet 4`, `Cont. on Sh. 7`) appear as text and must be preserved with their sheet number targets intact. + +## Revision Tracking and Delta Clouds + +Revision tables record the change history of the document. Each row contains a revision identifier (A, B, C, or 01, 02, 03 depending on convention), a date, a brief change description, and approval initials. These are tabular data and must be extracted as such. + +Delta clouds — the irregular closed-curve annotations that enclose changed areas in revised drawings — are graphical elements (annotation type `Ink` or rendered as path graphics in the content stream) with no inherent text content. However, a revision letter or ECO (Engineering Change Order) number is typically placed adjacent to the delta cloud boundary. Extraction should identify these isolated alphanumeric labels adjacent to closed irregular paths and tag them as revision markers associated with the spatial region they bound. + +## Parts and Materials Tables + +Parts lists and material specifications are tabular data that must never collapse into running text. A five-column BOM with 40 line items, if extracted as a text stream, becomes 200 sequential values with no row or column structure — useless for downstream processing. Correct extraction detects the table grid (either from cell boundary line segments or from text alignment in columns), identifies the header row, and emits each row as a structured record. Column headers — ITEM NO., PART NUMBER, QTY, DESCRIPTION, MATERIAL — are the schema; each data row is an instance. Material specification strings (`ASTM A36`, `6061-T6 ALUM`, `316 SS`) must be preserved verbatim, including the alphanumeric codes and their formatting, as these are references to external standards that require exact string matching. + +## 3D Annotation Text Components + +PDF/E's `Measure3D` annotation type and related 3D annotation subtypes carry measurement values as text in their `Contents` and `RC` (rich content) entries. A `Measure3D` annotation marking the distance between two faces might have `Contents` equal to `42.375 mm`. This text is the extractable output; the 3D coordinates that define the measurement endpoints are geometry. Extraction should treat all annotation `Contents` entries as first-class text, regardless of annotation type, while skipping the binary payload of `RichMedia` and `3D` annotation streams. The result is that measurement labels, view names, and assembly notes embedded as 3D annotation metadata surface in extraction output alongside the 2D drawing text, providing a complete picture of the document's informational content without requiring a 3D geometry parser. diff --git a/docs/research/legal-and-financial-pdf-patterns.md b/docs/research/legal-and-financial-pdf-patterns.md new file mode 100644 index 0000000..88145f8 --- /dev/null +++ b/docs/research/legal-and-financial-pdf-patterns.md @@ -0,0 +1,89 @@ +# Legal and Financial Document PDF Extraction Patterns + +PDF text extraction in legal and financial contexts is categorically harder than general document extraction. The document types produced by law firms, courts, accounting firms, and financial institutions share a set of structural conventions that interact poorly with naive bounding-box or stream-order extraction. This document catalogs the patterns pdftract must handle to produce readable, semantically coherent text from these sources. + +## Legal Document Structure + +Legal documents impose spatial zones that carry semantic meaning independent of their visual appearance. A complaint or contract typically opens with a **caption block** — a formatted header containing party names, court or jurisdiction, and case identifiers — set apart from the body by borders or whitespace. The caption is not prose; it is a structured field cluster. pdftract must recognize caption geometry (centered or left-aligned multi-line blocks in the top third of the first page) and flag the region so downstream consumers can treat it as metadata rather than flowing text. + +Numbered paragraphs are the backbone of most legal instruments. Body paragraphs carry explicit numbering (¶ 1, ¶ 2, or bare integers) that defines reading order independently of x/y position. When columns or marginal annotations are present, the paragraph number anchors reconstruction of logical order. pdftract must preserve these numbers in the extracted stream rather than stripping them as decoration. + +**Defined terms** appear in two conventions: ALL CAPS (e.g., `AGREEMENT`, `EFFECTIVE DATE`) and bold-faced title case (e.g., **Indemnified Party**). Both signal that the term has a formal definition elsewhere in the document. Extraction must preserve the casing and emphasis signals — stripping ALL CAPS to mixed case or discarding bold metadata silently destroys the term's identity. + +Exhibit references (`See Exhibit A`, `attached hereto as Schedule 3.2(b)`) appear inline in body text and at the tail of numbered paragraphs. They are forward pointers into attached or appended documents. pdftract should surface these references with their surrounding context intact so the extraction output carries the logical link. + +Signature blocks appear at the end of agreements and at the end of each amendment or addendum. Their spatial form — a grid of underscored lines paired with labels (`By:`, `Name:`, `Title:`, `Date:`) — is distinct from body text and must be flagged as a signature region rather than normalized as prose (see the dedicated section below). + +## Court Filing PDFs + +US federal and state court filings introduce a margin convention that directly attacks stream-order extraction. California and many other jurisdictions require numbered lines running from 1 to 28 down the left margin of every page. These line numbers are typeset as a separate text column with x-coordinates left of the body text column. A naive extractor reading by x/y order will interleave margin numbers with body words: `1 PLAINTIFF`, `2 respectfully`, `3 submits` becomes `1 PLAINTIFF 2 respectfully 3 submits` — syntactically broken. + +pdftract must detect the line-number column by identifying a narrow strip of monotonically increasing integers (1–28) occupying the leftmost 0.5–0.75 inches of each page and exclude it from the primary reading stream. The column can be extracted separately as line-number metadata for applications that need it (e.g., citation tools that reference "line 14 of page 3"), but it must not pollute the prose extraction. + +Page headers and footers in court filings carry case numbers, party names abbreviated to fit a single line, and docket identifiers. These repeat on every page and should be extracted once (from the first occurrence) and flagged as repeating header/footer metadata, not as flowing body text. Deduplication across pages is essential; legal briefs can run 50–200 pages with identical headers on every page, and a naive extraction will produce 50 copies of the case caption interleaved into the text. + +## Contract Clause Numbering and Hierarchy + +Modern commercial contracts use hierarchical numbering schemes that encode the document's logical structure: `1.`, `1.1`, `1.1.1`, then `(a)`, `(b)`, `(a)(i)`, `(a)(ii)`. Some instruments mix Arabic and Roman numerals, parenthetical letters, and unnumbered indented sub-clauses. The extraction challenge is twofold: preserving the numbering tokens themselves, and inferring the nesting depth from indentation so that a consumer can reconstruct the hierarchy. + +pdftract must capture the indentation level of each clause by computing the left-margin offset relative to the document's base margin. An increase in left offset combined with a change in numbering style signals a deeper nesting level. The extracted text for a clause like `(a)(i)` should carry metadata indicating it is two levels below its parent section `1.1`, even if the raw character stream contains only the token `(a)(i)` followed by prose. + +Clause cross-references (`as defined in Section 8.2(c)`, `subject to Section 4.1.3(b)(ii)`) are high-value in legal extraction. pdftract should preserve these tokens intact — no normalization or abbreviation — because they are the connective tissue of the document's logic. + +## Redline and Tracked-Changes PDFs + +Redline documents represent negotiation state. They show both the prior text (struck through, typically in red) and the proposed replacement text (inserted, typically in a contrasting color or underlined). When a redline PDF is generated from a word processor, the two versions coexist spatially on the same page. + +pdftract must handle redline extraction in at least two modes. In **clean extraction** mode, only the inserted (accepted) text is emitted, and struck-through runs are discarded. In **both-versions** mode, the output interleaves deletion markers and insertion markers so the full negotiation delta is preserved: `[-old text-]{+new text+}` or an equivalent structured representation. Detecting which runs are struck-through requires inspecting text rendering flags or, in tagged PDFs, structure element attributes. For untagged redlines (the majority in practice), horizontal strikethrough lines overlapping text runs are the signal — pdftract must correlate line annotation objects with the text they cross rather than treating them as independent graphical decoration. + +Color is a supporting signal but not a reliable primary detector. Firms use different color conventions; some redlines show deletions in red and insertions in blue, others use magenta for one party's changes and green for another's in multi-party negotiations. pdftract should surface color-tagged text runs with their RGB values so downstream logic can apply firm-specific or document-specific color mapping. + +## Financial Statement PDFs + +Annual reports, audited financial statements, and interim filings are table-dominated. A balance sheet may span three columns (current year, prior year, notes reference) with a header row that spans all three. Income statements carry subtotals, blank separator rows, and grand totals that repeat across column groups. + +pdftract's table extraction for financial statements must handle **spanning headers**: a single cell whose text covers two or more columns below it. The physical PDF representation typically places the header text once, horizontally centered over the columns it spans, with no explicit cell boundary in the character stream. Reconstructing the spanning relationship requires measuring the header text's bounding box against the column grid inferred from the data rows below. + +Negative numbers in financial statements appear in parentheses: `(1,234,567)`. This convention is distinct from prose parenthetical remarks and must be preserved exactly — converting to a minus sign or stripping the parentheses changes the semantic value. Currency symbols (`$`, `€`, `£`, `¥`) may appear in the first row of a column only, with subsequent rows implying the currency. pdftract should not drop currency symbols or normalize them to a generic marker. + +## SEC Filing Patterns + +EDGAR filings (10-K, 10-Q, 8-K, S-1, and registration statements) are submitted as HTML or iXBRL and then rendered to PDF by EDGAR's viewer or by the filer. This pipeline introduces conversion artifacts: fonts embedded as image tiles rather than text glyphs, table cells that overflow their bounding boxes and overlap adjacent cells, and hyperlinks rendered as visible URL text that breaks line flow. + +Inline XBRL tags (`ix:nonFraction`, `ix:nonNumeric`) do not appear visually in the PDF but may survive as invisible text runs in the PDF character stream if the conversion preserved them. pdftract must strip these zero-width or hidden-layer text fragments from the extraction output rather than treating them as content. + +Table of contents pages in SEC filings use dot leaders — rows of periods connecting a section title to a page number. The dots are typeset as a repeating character sequence, not as a tab or graphic rule. pdftract must recognize the dot-leader pattern (a run of `.` or `·` characters spanning most of the line width, followed by a page number) and collapse the run to a single tab-equivalent rather than emitting hundreds of period characters into the text stream. + +## Prospectus and Offering Documents + +Prospectuses (S-1, S-11, prospectus supplements) use multi-level nested tables to present use-of-proceeds summaries, capitalization tables, and summary financial data. Tables may be nested three levels deep, with outer tables controlling layout and inner tables holding data. Extraction must detect the logical data table within the layout scaffolding and not flatten all cells into a single indistinguishable stream. + +Tombstone blocks — the formatted announcement of a securities offering showing issuer, amount, bookrunners, and offering date in a bordered box — appear on cover pages and in marketing materials. Their spatial isolation and internal structure (stacked centered text, often in varying font sizes) distinguish them from body prose. pdftract should flag tombstone geometry as a cover block rather than attempting to integrate it into reading-order prose. + +Footnote networks in prospectuses are dense. A single table may carry a dozen footnote markers, with footnotes running across multiple pages. pdftract must associate each footnote marker in the body with its corresponding footnote text, preserving the numeric or alphabetic marker for cross-reference, and must handle footnotes that continue across a page break. + +## Invoice and Purchase Order PDFs + +Invoices and POs are semi-structured forms with fields occupying fixed regions. Key fields include vendor name and address, customer name and address, invoice number, invoice date, due date, line items (description, quantity, unit price, extended amount), subtotal, tax amount, shipping, and total due. These fields may be laid out in two- or three-column grids with labels left-aligned and values right-aligned or in labeled boxes. + +pdftract must extract these as key-value pairs rather than flowing prose. The extraction challenge is that field labels and values are spatially adjacent but may not share a text run — they occupy separate bounding boxes, often with no character-stream relationship. Associating `Invoice No.:` with `INV-2024-00891` requires spatial proximity analysis, not just stream-order reading. + +Line item tables in invoices follow a standard grid: each row is one billable item. pdftract's table detection must handle right-aligned numeric columns (where the decimal points or right edges of numbers align, not the left edges of cells) and compute correct column association even when column borders are absent. + +## Check and Payment Voucher PDFs + +Check images embedded in PDFs or check-layout PDFs present two parallel representations of the payment amount: the numeric amount (`$1,234.56`) and the legal amount in words (`One Thousand Two Hundred Thirty-Four and 56/100 Dollars`). Both must be extracted and surfaced together — they serve different verification purposes and must not be conflated. + +The MICR line at the bottom of a check encodes routing number, account number, and check number in MICR E-13B or CMC-7 font. When rendered in a PDF from a scan or a check-printing application, MICR characters may be typeset in a non-standard font that maps to unusual Unicode code points or that requires font-specific glyph remapping. pdftract must handle MICR font substitution and normalize the output to the corresponding digit characters and MICR delimiter symbols (`⑆` for routing, `⑈` for amount). + +## Signature Block Detection + +Signature blocks follow a spatial template that is nearly universal across legal document types. They appear at the end of the document (or at the end of each signatory section) and consist of one or more parallel columns, each containing an underscored blank line for the actual signature followed by labeled fields: `By:`, `Name:`, `Title:`, `Date:`, and sometimes `Address:` or `Email:`. The underscored line is typically rendered as a sequence of underscore characters or as a drawn horizontal rule. + +pdftract must flag signature block regions so that downstream consumers can distinguish them from content. An unfilled signature block should not be extracted as body text at all — the blank lines carry no information. A filled signature block (where names and dates have been typed or handwritten and scanned) presents the fields as labeled key-value pairs and should be extracted as structured metadata: `signatory_name`, `signatory_title`, `signature_date`. + +Detection heuristics: a cluster of labels matching the canonical set (`By:`, `Name:`, `Title:`, `Date:`) within a spatial proximity of roughly 1–2 inches, preceded by a horizontal rule or a run of underscores, is a signature block with high confidence. Multiple such clusters arranged side by side indicate multiple signatories. pdftract should emit a signature block record for each cluster rather than treating the region as unstructured text. + +--- + +Together, these patterns define the minimum surface area pdftract must cover to be useful in legal and financial workflows. None of the required behaviors are edge cases — they appear in the majority of documents produced by practitioners in these fields. Correct handling of margin line numbers, clause hierarchy, redline deltas, MICR fonts, dot leaders, and signature regions separates a general PDF extractor from a tool that legal and financial teams can trust. diff --git a/docs/research/stroke-and-outlined-text.md b/docs/research/stroke-and-outlined-text.md new file mode 100644 index 0000000..932b73c --- /dev/null +++ b/docs/research/stroke-and-outlined-text.md @@ -0,0 +1,91 @@ +# Stroke-based and Outlined Text in PDFs + +## Overview + +The `Tr` operator controls whether glyph outlines are filled, stroked, both, neither, or used to define a clip path. Mode 0 (fill) covers most PDF text, but the remaining modes appear regularly enough that pdftract must handle all eight. The critical insight is that rendering mode does not alter encoding or Unicode mapping — a character code maps to a codepoint through the same ToUnicode CMap, Differences array, or built-in encoding regardless of visual rendering. `Tr` affects paint, not identity. + +--- + +## 1. Text Rendering Modes (Tr 0–7) + +The PDF specification (ISO 32000-2 §9.3.6) defines eight text rendering modes indexed 0 through 7. The `Tr` operator sets the mode within the text state, which is part of the graphics state and subject to `q`/`Q` save-restore semantics. + +| Mode | Name | Fill | Stroke | Clip added | +|------|------|------|--------|------------| +| 0 | Fill | yes | no | no | +| 1 | Stroke only | no | yes | no | +| 2 | Fill then stroke | yes | yes | no | +| 3 | Invisible | no | no | no | +| 4 | Fill + clip | yes | no | yes | +| 5 | Stroke + clip | no | yes | yes | +| 6 | Fill + stroke + clip | yes | yes | yes | +| 7 | Clip only | no | no | yes | + +`Tr` defaults to 0 at the start of each page content stream and must be tracked in the graphics state stack. Every text-showing operator — `Tj`, `TJ`, `'`, `"` — emits glyphs under the active Tr. pdftract records the rendering mode into each extracted span when the operator is processed. + +--- + +## 2. Rendering Mode 1 — Stroke Only + +In mode 1, the renderer traces the glyph outline and strokes it without filling the interior, producing hollow or wireframe letterforms. This appears in display typography, decorative headings, and logo-embedded PDFs. + +From an extraction standpoint, mode 1 text is fully accessible. Character codes are present in the content stream in exactly the same form as mode 0 text. The font's encoding and ToUnicode CMap apply identically, and advance widths drive glyph positioning unchanged. pdftract reads mode 1 spans through the same decoding path with no divergence. + +One nuance is bounding box computation. A stroked glyph visually occupies more space than a filled one by half the stroke width on each side. If pdftract computes tight bounds for layout analysis, it should account for the stroke width when the rendered boundary matters for reading-order or column detection. For Unicode output this is irrelevant. Confidence for mode 1 is equivalent to mode 0 — the character data is unambiguously present and the rendering mode does not indicate OCR-derived or erroneous content. + +--- + +## 3. Rendering Mode 2 — Fill Then Stroke + +Mode 2 applies both a fill and a stroke pass to each glyph, producing bold or outlined letterforms with a colored border around a filled body. It is common in slide decks and documents where letter contrast over a complex background is needed. + +Extraction from mode 2 is identical to mode 0. Both paint passes use the same character codes, advance widths, and Unicode mapping. Fill color, stroke color, and stroke width differ, but none affect text identity. One error to avoid is double-counting: the PDF content stream issues a single text-showing operator per glyph regardless of Tr — the rendering mode governs paint passes, not stream events. pdftract emits one span per text-showing operator with the rendering mode recorded in metadata. + +--- + +## 4. Rendering Mode 3 — Invisible Text + +Mode 3 applies no fill and no stroke. The glyph produces no visible marks, but the text engine still processes it — advance widths accumulate, the text matrix advances, spacing operators apply. This is the most consequential rendering mode for extraction. + +### 4.1 The PDF/A Scan-plus-OCR Pattern + +The dominant use of mode 3 is the searchable scan. A scanner captures a raster image of the page; OCR software recognizes the text and embeds the results as invisible glyphs positioned to overlay the image. The resulting PDF has a visible image layer and an invisible text layer carrying all machine-readable content. PDF/A-3b and PDF/UA both permit this pattern, and it is the standard output of commercial document scanning and archiving pipelines. + +pdftract must extract mode 3 text without exception. Suppressing it would silently discard the only machine-readable content in a large fraction of real-world documents — archived records, legal filings, and government materials. The extraction path is identical to mode 0: character codes are read from the content stream, mapped through font encoding, and output as Unicode. + +### 4.2 Confidence Scoring for Mode 3 + +pdftract applies a two-tier confidence model for mode 3 spans. When the font carries an explicit ToUnicode CMap, confidence is high — the OCR engine wrote recognized Unicode directly and the text is as reliable as the OCR pass itself. When the font lacks a ToUnicode CMap and relies on a Differences array, built-in encoding, or glyph-name inference, confidence is moderate — an indirect path seen in older scan workflows where the recovered Unicode may benefit from downstream validation. + +### 4.3 Detecting Mode 3 Abuse + +OCR errors propagate silently through invisible text layers. A visible image may show "exhibit" while the invisible layer encodes "exh1bit" because the OCR engine misread a numeral. pdftract cannot correct OCR errors at extraction time, but it should tag mode 3 spans with `ocr_layer: true` when a raster image covers the same page region. If the font's ToUnicode CMap is incomplete — glyph codes unmapped, or mapping to U+FFFD — pdftract adds `encoding_incomplete: true` and lowers confidence. These tags give downstream systems enough signal to flag suspect spans without requiring pdftract to adjudicate correctness. + +--- + +## 5. Rendering Modes 4–7 — Clipping Combinations + +Modes 4 through 7 each accumulate the glyph outline into the current clipping path in addition to their paint behavior. Mode 4 fills and clips; mode 5 strokes and clips; mode 6 fills, strokes, and clips; mode 7 clips only. In all cases the glyph outline is added to the clip path immediately after rendering, and subsequent graphics operations are clipped to the accumulated glyph shapes until the graphics state is restored with `Q`. This produces the typographic masking effect where imagery is visible only through letterforms. + +For Unicode extraction, clip modes are handled identically to their non-clip counterparts: mode 4 as mode 0, mode 5 as mode 1, mode 6 as mode 2, mode 7 as mode 3. The clip path accumulation has no bearing on text content. What pdftract must track correctly is the graphics state mutation: the clip path built during mode 4–7 glyphs persists until `Q`, and if pdftract models clip paths for image region inference it must advance that model glyph by glyph. Failing to do so corrupts clip state for all subsequent operations on the page. + +--- + +## 6. Rendering Mode and Font Type Interaction + +Rendering mode does not interact with font type in any way that affects Unicode extraction. Type 1, TrueType, CFF, OpenType, CIDFont Type 0, CIDFont Type 2, and Type 3 fonts all expose character codes through the same encoding mechanisms regardless of Tr. The `Tr` operator controls paint operations applied to the glyph outline; it has no effect on how the outline is retrieved from the font program or how the character code is resolved to a name or codepoint. + +pdftract's font decoding layer — resolving `Encoding`, `ToUnicode`, `Differences`, glyph names, and CID-to-Unicode mappings — is invoked identically for every rendering mode. There is no branch point on Tr in the decoding path. + +--- + +## 7. Output Metadata and Confidence Policy + +Every span produced by pdftract carries a `rendering_mode` integer whose value is the `Tr` active when the span's text-showing operator was processed. The field defaults to 0 when no explicit `Tr` operator has appeared. Downstream consumers should interpret it as follows: + +- **Modes 0, 1, 2** — Text is visually rendered. Confidence is determined by encoding quality, not rendering mode. +- **Mode 3** — Text is invisible. If `ocr_layer: true`, treat as OCR-derived content. Confidence is high when the font has a complete ToUnicode CMap; moderate when it does not. +- **Modes 4, 5, 6** — Text is rendered and modifies the clip path. Confidence follows modes 0, 1, 2 respectively. +- **Mode 7** — Text is invisible and modifies the clip path. Confidence follows mode 3 rules. + +pdftract must not suppress or omit text from any rendering mode in its primary output — filtering by rendering mode is a caller policy. The library surfaces all text present in the content stream with accurate metadata. The graphics state tracker must save and restore `Tr` on every `q`/`Q` pair, reset it to 0 at each page boundary, and propagate it into every emitted span. Tracking `Tr` as a first-class part of the graphics state is what makes correct extraction across all eight modes possible.