pdftract/docs/research/engineering-document-extraction.md
jedarden eac3235291 Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs
Four new extraction research documents covering text rendering modes
(Tr 0-7 including invisible OCR layers), legal/financial document
extraction patterns, character-level confidence aggregation with output
schema, and PDF/E engineering document handling (CAD, GD&T, schematics).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:35:48 -04:00

63 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Engineering Document PDF Extraction
## PDF/E and the Engineering PDF Landscape
PDF/E-1 (ISO 24517-1) is a PDF 1.6 conformance level designed specifically for the exchange of engineering documents. Beyond the baseline PDF 1.6 feature set, PDF/E-1 mandates or restricts several capabilities relevant to extraction. It requires that all fonts be embedded, eliminating the ambiguity of system font substitution that plagues general-purpose PDF extraction. It prohibits encryption that would prevent conforming readers from rendering content, which means a conforming PDF/E file should always be extractable without decryption barriers. It also defines a formal attachment model that permits embedded 3D content streams in either U3D (Universal 3D) or PRC (Product Representation Compact) format, attached via `RichMedia` annotations or the `3D` annotation type introduced in PDF 1.6.
The critical distinction for an extraction library is that 3D geometry embedded in these annotations is binary format geometry — vertices, surfaces, B-rep topology, material properties — not text. The annotation itself may carry a text component: an `AP` (appearance stream) that renders a 2D projection or placeholder, a `Contents` entry with a label, and `Measure` dictionaries that can include numeric values and unit strings. These annotation-level text components are legitimate extraction targets. The geometry data stream itself is not. A correct extraction strategy treats 3D annotation content entries and their associated measurement labels as first-class text, while explicitly ignoring the binary 3D stream payload.
PRC-embedded metadata warrants a separate note. PRC files may contain a product structure tree with assembly names, part names, and attribute strings. When PRC data is embedded as a file attachment (rather than an inline stream), the attachment filename and any `/EmbeddedFile` metadata fields are extractable as document metadata, though the internal PRC tree requires a PRC parser outside the scope of text extraction.
## Engineering Drawing Structure as an Extraction Model
A well-structured engineering drawing follows conventions that, when understood, transform extraction from a spatial guessing game into a structured parse. The title block — universally located in the lower-right corner of the sheet — contains a bounded set of labeled fields: document or drawing number, sheet number, revision level, scale, drawn-by, checked-by, approved-by, and date. These fields are vector text rendered in a fixed spatial region. An extraction pass that identifies the lower-right quadrant of a landscape page and groups text clusters within it can reliably reconstruct the title block as structured key-value pairs rather than a stream of isolated glyphs.
Notes and callouts are positioned throughout the drawing field. Callout text typically appears at the endpoint of a leader line — a graphical path with an arrowhead at the geometry end and text at the annotation end. The text endpoint is the extraction target. Leader line paths in PDF are drawn as graphics operators (`m`, `l`, curve operators) and carry no inherent connection to the text they point to. Spatial proximity is the only available signal: the text cluster nearest to the non-arrowhead end of a leader path is the callout label for that leader. Extraction must preserve these as spatially-associated pairs rather than treating the text as free-floating.
The bill of materials (BOM) table and the revision history block are the two most structured text regions in a typical drawing. The BOM lists item numbers, part numbers, quantities, descriptions, and often material specifications in a tabular grid. The revision block records revision letter, date, description, and approval initials in a separate table, usually stacked in the lower-right corner above or beside the title block. Both must be extracted as tables — row and column structure intact — not as linear text streams. Line segment detection (horizontal and vertical strokes forming cell boundaries) combined with text clustering within each cell provides the correct reconstruction.
## CAD-to-PDF Conversion Artifacts
CAD systems produce PDF through an internal rendering pipeline that converts model annotations, dimensions, and symbols to PDF content streams. This conversion is frequently lossy in ways that complicate extraction. Exploded dimension text is the most common artifact: a linear dimension that appears to a human as a single object — say, `24.500 ±0.005` — may be stored in the PDF as three separate text objects at three separate positions: the nominal value, the tolerance value, and the unit string, each placed relative to the dimension line geometry. An extraction that simply serializes glyphs in reading order may interleave these fragments with other nearby text, producing output like `24.500 R0.375 ±0.005 [4×]`.
Recovering exploded dimension text requires recognizing that dimension annotation components cluster tightly around a dimension line path, that their bounding boxes often overlap in one axis, and that the reading order within a dimension cluster is determined by the dimension type (linear horizontal, linear vertical, radial, angular) rather than by absolute x/y position. Grouping logic that detects these clusters and serializes them as a unit — before the global reading-order sort — is the correct approach.
GD&T symbols present a character-level challenge. GD&T uses a defined symbol vocabulary: ⌀ (diameter), ⊕ (position), ⊙ (circularity), ⌖ (concentricity), ⊘ (symmetry), ▷ (flatness indicator in some conventions), and others. In well-produced PDFs, these appear as Unicode characters (U+2205, U+2295, U+2299, etc.) embedded in a symbol font with correct ToUnicode mappings. In poorly-produced PDFs, they appear as glyphs in a proprietary font with no ToUnicode table, mapping to arbitrary code points. Extraction must attempt ToUnicode lookup first, fall back to glyph-name-to-Unicode mapping using the AGL (Adobe Glyph List) and the engineering symbol extensions, and for truly unmapped glyphs, use glyph outline shape matching against a reference set of GD&T symbols to identify and substitute the correct Unicode code point. Silently dropping unmapped glyphs produces output that looks like `∅0.010` but is actually `` 0.010` — invisible damage to safety-critical specifications.
## Technical Manuals: Procedures and Safety Callouts
Technical manual PDFs share structural features with legal documents but carry safety-critical content that makes extraction fidelity non-negotiable. Numbered procedures are hierarchically structured: step 1., substep 1.1, action 1.1.a. The indentation level and numbering scheme together define the hierarchy. PDF does not encode this hierarchy; it must be inferred from x-position (indentation depth) and the numeric prefix pattern.
Warning, Caution, and Note callout boxes are a distinctive feature of technical manuals following ANSI Z535 or MIL-STD-38784 conventions. These appear as bordered boxes, often with the label in a distinct font weight or color (red or orange for WARNING, yellow for CAUTION, blue or black for NOTE). The bordered box is a graphics element; the label and body text inside are separate text streams. Extraction must identify these box-and-text composites and tag the resulting text with its callout type — not merely serialize the words "WARNING" along with the body text as if they were paragraph prose. A WARNING that loses its semantic marking becomes invisible in downstream processing.
Figure references (`See Figure 3-4`, `refer to Detail B`) and parts list references (`P/N 45-8812-002`) appear throughout manual text and link across pages. These are text extraction targets with no special handling required at the extraction layer, but they must survive with their alphanumeric content intact — dashes, slashes, and dots in part numbers are frequently dropped by naive tokenizers.
## Schematic PDFs: Spatial Context for Text Labels
Electrical schematics and P&ID (Piping and Instrumentation Diagram) PDFs present the spatial-grouping problem in its most extreme form. Every text element — component reference designators (R1, C47, U3), wire labels (net names, voltage rails), tag numbers (FV-101, TIC-204) — is positioned relative to a symbol or wire graphic with no structural link in the PDF content stream. The symbol is a set of vector paths; the label is a nearby text object; the association is purely spatial.
Extraction strategy must segment a schematic page into spatial neighborhoods, cluster text within each neighborhood around its parent symbol or wire segment, and emit the text with its spatial context preserved. For P&ID specifically, ISA 5.1 tag numbers follow a structured format (instrument function letters followed by loop number) that can be validated post-extraction to catch OCR or encoding errors.
## Tolerance Notation and Special Characters
Tolerance and specification notation in engineering PDFs depends on correct Unicode round-tripping. The ± symbol (U+00B1) must survive extraction as a single character, not as a `+` followed by a `-` stacked via vertical offset. Superscript and subscript characters — common in unit expressions like `N/m²` (U+00B2) or `10⁶` — may be rendered in PDF as normal-size characters with a vertical baseline offset rather than as Unicode superscript code points. Extraction must detect the baseline offset pattern and, where the character is in the range that has a defined Unicode superscript equivalent (digits 09, n, i), substitute the correct Unicode code point. Where no Unicode superscript exists, the text should be emitted with a markup convention (e.g., `^{text}`) rather than silently dropped or merged with adjacent baseline text.
Fractions are similarly fragile. A fraction like `3/8` may be a single Unicode vulgar fraction (U+2158) or three separate characters. A mixed number like `1 3/8"` may be five characters or a combination of a regular `1`, a Unicode vulgar fraction, and an inch symbol. Both representations must extract to the same canonical form.
## Multi-Sheet Documents and Sheet Metadata
Large engineering documents are multi-sheet PDFs where each page corresponds to a numbered drawing sheet. Sheet metadata — sheet number, total sheet count, drawing number, revision — appears in the title block of each sheet and must be extracted per-page, not aggregated. A drawing index sheet (often sheet 1 of N) lists all sheet numbers with their titles and may be structured as a table. Cross-sheet references (`See Sheet 4`, `Cont. on Sh. 7`) appear as text and must be preserved with their sheet number targets intact.
## Revision Tracking and Delta Clouds
Revision tables record the change history of the document. Each row contains a revision identifier (A, B, C, or 01, 02, 03 depending on convention), a date, a brief change description, and approval initials. These are tabular data and must be extracted as such.
Delta clouds — the irregular closed-curve annotations that enclose changed areas in revised drawings — are graphical elements (annotation type `Ink` or rendered as path graphics in the content stream) with no inherent text content. However, a revision letter or ECO (Engineering Change Order) number is typically placed adjacent to the delta cloud boundary. Extraction should identify these isolated alphanumeric labels adjacent to closed irregular paths and tag them as revision markers associated with the spatial region they bound.
## Parts and Materials Tables
Parts lists and material specifications are tabular data that must never collapse into running text. A five-column BOM with 40 line items, if extracted as a text stream, becomes 200 sequential values with no row or column structure — useless for downstream processing. Correct extraction detects the table grid (either from cell boundary line segments or from text alignment in columns), identifies the header row, and emits each row as a structured record. Column headers — ITEM NO., PART NUMBER, QTY, DESCRIPTION, MATERIAL — are the schema; each data row is an instance. Material specification strings (`ASTM A36`, `6061-T6 ALUM`, `316 SS`) must be preserved verbatim, including the alphanumeric codes and their formatting, as these are references to external standards that require exact string matching.
## 3D Annotation Text Components
PDF/E's `Measure3D` annotation type and related 3D annotation subtypes carry measurement values as text in their `Contents` and `RC` (rich content) entries. A `Measure3D` annotation marking the distance between two faces might have `Contents` equal to `42.375 mm`. This text is the extractable output; the 3D coordinates that define the measurement endpoints are geometry. Extraction should treat all annotation `Contents` entries as first-class text, regardless of annotation type, while skipping the binary payload of `RichMedia` and `3D` annotation streams. The result is that measurement labels, view names, and assembly notes embedded as 3D annotation metadata surface in extraction output alongside the 2D drawing text, providing a complete picture of the document's informational content without requiring a 3D geometry parser.