diff --git a/docs/research/error-handling-and-robustness.md b/docs/research/error-handling-and-robustness.md new file mode 100644 index 0000000..5d67364 --- /dev/null +++ b/docs/research/error-handling-and-robustness.md @@ -0,0 +1,90 @@ +# Error Handling, Robustness, and Graceful Degradation in PDF Extraction + +## Overview + +PDF extraction in the real world is not a clean parsing problem. Documents produced by hundreds of different authoring tools over three decades accumulate every imaginable deviation from the ISO 32000 specification: truncated files written by crashed processes, streams whose declared `/Length` disagrees with their actual byte count, circular indirect object references, content streams that open a `BT` block and forget to close it. A production extraction library that fails on any of these inputs is a liability. pdftract is designed on the principle that a document should never be fully unextractable due to a localized structural defect, and that every degraded extraction must come with a precise diagnostic record so callers can make informed decisions about the output they receive. + +--- + +## Error Taxonomy + +PDF parsing errors divide into five distinct classes, each demanding a different recovery posture. + +**Structural errors** are defects in the file-level framing: a malformed or missing cross-reference table, an absent `%%EOF` marker, or a file that simply ends mid-stream. These affect the parser's ability to locate objects at all. pdftract handles structural errors through a two-pass xref strategy: the primary pass attempts normal xref/trailer parsing; if that fails, a linear scan of the byte stream recovers `obj` markers directly, rebuilding a synthetic object table from first principles. Pages whose object offsets are recovered via linear scan are flagged with `xref_reconstructed` in their per-page diagnostics. + +**Object errors** occur when an individual indirect object has malformed syntax — mismatched delimiters, a type tag that contradicts the object's actual content, or a stream object missing its `endstream` keyword. When pdftract encounters an unparseable object, it records the object number and offset, emits an `object_parse_error` diagnostic, and substitutes a null value for that object in the object table. Callers that reference the failed object receive null rather than an exception. + +**Stream errors** arise when the bytes of a stream cannot be decoded — an FlateDecode stream whose zlib data is corrupt, a LZWDecode stream with a truncated bitstream, or a declared `/Length` that runs past the actual `endstream` marker. Each stream is decoded independently. A decode failure emits a `stream_decode_error` diagnostic scoped to the specific stream's object number and page, and processing continues with the next stream on the same page. The raw (undecoded) bytes are never surfaced as text content. + +**Content stream errors** live one level deeper: the decoded stream bytes parse as valid PDF syntax, but the operators or operands within the content stream are malformed — an unknown operator name, a `Tf` call with a missing operand, or a numeric value where a name is required. These are handled by pdftract's content stream state machine, described in detail below. + +**Semantic errors** are structurally valid objects that violate the logical requirements of the PDF spec: a page dictionary lacking a required `/MediaBox`, a font dictionary missing `/Encoding`, a `/Resources` dictionary that references a font name not defined anywhere in the document. These are handled through fallback defaults rather than aborts, also detailed below. + +--- + +## The Permissive Parsing Principle + +PDF viewing software has historically tolerated extraordinary levels of spec deviation to avoid frustrating end users with unrenderable documents. The result is that a large fraction of PDFs in the wild contain violations that a strict parser would reject. pdftract adopts a formally permissive posture: parsing never hard-fails on a syntax deviation unless the deviation is so severe that no coherent interpretation of the surrounding bytes exists. Whitespace tolerance, delimiter mismatches, and minor keyword misspellings are handled silently. Every permissive decision is logged internally but only surfaces in the output diagnostics when it crosses a threshold that affects extraction quality. The goal is to extract as much text as the data permits, not to validate the document. + +--- + +## Truncated Files + +A truncated PDF — one where the writing process was interrupted before `%%EOF` was emitted, or where the xref section was never completed — is detected during the structural parsing phase. pdftract considers a file truncated when the xref table is absent or incomplete and linear scan recovers fewer object offsets than the trailer's `/Size` field declares. + +When truncation is detected, extraction proceeds against the objects that were successfully recovered. Pages whose object graph is fully reconstructable are extracted normally. Pages that reference objects beyond the truncation point are emitted as empty pages with a `page_truncated` diagnostic. The document-level output carries a `truncated: true` flag in its metadata block, and the `extraction_quality` field reflects the proportion of pages that yielded content. + +--- + +## Corrupt Streams + +Stream decode errors are isolated at the stream boundary. When FlateDecode, LZWDecode, CCITTFaxDecode, or any other filter raises a decode exception, pdftract catches the exception at the stream decompressor boundary, records a `stream_decode_error` diagnostic with the object number, filter name, and byte offset where decompression failed, and moves on. The page is not aborted. If a page has three content streams and one fails to decode, the text from the other two is still extracted and reported. This per-stream isolation is enforced by the architecture: each stream is decoded in a sandboxed context with no shared mutable state that a failure could corrupt. + +--- + +## Content Stream Error Recovery + +The content stream interpreter is implemented as a pushdown state machine. Each token — operand or operator — is consumed one at a time. When the interpreter encounters an unknown operator name, it discards the pending operand stack, emits an `unknown_operator` diagnostic with the operator name and stream position, and resumes consuming tokens. When an operand is malformed (a string that cannot be decoded, a number outside representable range), the bad token is skipped and the state machine continues from the next well-formed token. + +BT/ET balance is tracked with a counter. If an `ET` is encountered with no preceding `BT`, or if end-of-stream is reached with an open text block, the text state is reset to a clean initial state and a `bt_et_mismatch` diagnostic is emitted. Text extracted before the mismatch point is preserved; the reset ensures subsequent operators are interpreted against a coherent state rather than stale matrix and font references. + +--- + +## Missing Required Keys and Semantic Fallbacks + +When a page dictionary has no `/Contents` key, pdftract emits an empty page with no error — this is a valid degenerate page per the spec, and some documents use empty pages intentionally. When `/Contents` references an object that cannot be resolved, a `missing_contents` diagnostic is emitted and the page is empty. + +Font references that appear in a content stream but are absent from `/Resources` enter a glyph fallback pipeline: pdftract attempts to resolve the font by name against a built-in metrics table covering the 14 standard PDF fonts and common Adobe variants. If resolution fails, character codes are passed through as Unicode replacement characters (U+FFFD), and a `font_not_found` diagnostic records the missing font name. The extraction continues; consumers can decide whether unresolved font references are acceptable for their use case. + +When `/MediaBox` is absent from a page dictionary — and cannot be inherited from parent nodes in the page tree — pdftract defaults to US Letter dimensions (612×792 points). This default has no effect on text extraction, which is geometry-independent, but it ensures that coordinate normalization for spatial layout features produces coherent results rather than division-by-zero failures. + +--- + +## Circular References + +Indirect object references can form cycles: object A resolves to a dictionary containing a reference to object B, which resolves to a dictionary referencing A. Such cycles arise from corrupted or adversarially crafted documents. pdftract tracks the resolution path for each lookup using a thread-local visit set. Before dereferencing any indirect reference, the target object number is checked against the current visit set. If the object number is already present, the lookup returns null and a `circular_reference` warning is logged with the full cycle path. Resolution then unwinds normally. The visit set is cleared between top-level page extractions so that legitimate repeated references (a font shared across many pages) are not falsely flagged. + +--- + +## Stack Overflows in Deeply Nested Structures + +Page trees with thousands of intermediate nodes, Form XObjects that nest dozens of levels deep, and arrays or dictionaries containing nested structures that exceed practical recursion depth are handled with explicit stack data structures rather than native call stack recursion. The page tree walker maintains an explicit deque of pending nodes. The content stream interpreter's Form XObject descent uses an explicit execution stack rather than recursive calls. Array and dictionary parsing uses an iterative token accumulator. This design eliminates the class of stack overflow crashes that recursive descent parsers encounter on pathological inputs, and it makes the maximum nesting depth a configurable parameter rather than a property of the platform's call stack size. + +--- + +## Memory Limits for Pathological Inputs + +Some malformed PDFs declare enormous object counts, arrays with millions of entries, or xref tables that claim millions of objects. pdftract enforces practical extraction limits: a maximum of 100,000 indirect objects per document, a maximum of 1,000 Form XObject nesting levels per page, and a maximum array or dictionary size of 65,536 entries per object. When a limit is reached, further items are dropped, a `limit_exceeded` diagnostic is emitted with the limit name and the actual count encountered, and processing continues with the truncated structure. These limits are exposed as configurable parameters so callers with specific workload characteristics can tune them without recompiling. + +--- + +## Output Quality Signaling + +Every extraction response carries an `extraction_quality` field drawn from a four-value enum: + +- **Complete**: all pages extracted without diagnostics affecting content. +- **Partial**: one or more pages are empty due to structural or stream errors, but the majority of the document was extracted successfully. Threshold: fewer than 20% of pages have content-affecting errors. +- **Degraded**: significant extraction failures; more than 20% of pages have content-affecting errors, or the xref required full linear reconstruction. +- **Failed**: no content could be extracted from any page. + +The `errors` array in the output contains one entry per diagnostic event. Each entry carries a `code` (e.g., `stream_decode_error`), a human-readable `message`, a `severity` (`warning`, `error`, or `fatal`), and a `location` block with `page`, `object`, and `stream` fields populated where applicable. This structure gives callers a machine-readable audit trail: a document processing pipeline can decide to retry with a different extraction strategy, quarantine the document, or simply log the errors and proceed, based on the specific codes and counts present. The intent is that pdftract never silently degrades — every compromise in extraction quality has a corresponding record in the output. diff --git a/docs/research/extraction-output-schema.md b/docs/research/extraction-output-schema.md new file mode 100644 index 0000000..c9ec4ca --- /dev/null +++ b/docs/research/extraction-output-schema.md @@ -0,0 +1,81 @@ +# Extraction Output Schema, API Surface, and Structured Output Design + +## Overview + +The pdftract extraction output schema is designed around a single, governing principle: every downstream use case — RAG ingestion, full-text search indexing, accessibility auditing, forensic analysis, and archival preservation — must be satisfiable from a single extraction pass without requiring re-processing. This demands a schema that is simultaneously comprehensive and layered, exposing fine-grained atomic data at lower levels while assembling semantic structure at higher levels, with clean separation between what belongs at document scope and what belongs at page scope. + +--- + +## Document-Level Structure + +The root JSON object is the document envelope. It carries everything that is not inherently per-page: document `metadata`, the navigation `outline` (bookmark tree derived from the PDF `/Outlines` dictionary), `threads` (article thread chains linking related content across non-contiguous pages), `attachments` (embedded files and portfolio entries), `signatures` (digital signature fields with their coverage and validation state), document-scoped `links` (cross-document or external URI targets that span multiple pages or resolve at document level), `form_fields` (AcroForm and XFA field definitions with their values), an `extraction_quality` aggregate summarizing confidence across the entire document, and an `errors` array holding all diagnostic events from the extraction run. + +The `pages` array is also at document level, each entry being a self-contained page object. Placing `pages` at the root allows consumers to address any page by index without traversing nested structures, and makes NDJSON streaming (described below) a natural projection of the same schema rather than a separate format. + +Fields that are inherently global — metadata, signatures, the outline tree, embedded attachments — must not be duplicated inside page objects. Conversely, anything that varies per-page (geometry, content blocks, annotations) must not be flattened to the document level. This division is what keeps both the full-document JSON and per-page NDJSON frames self-consistent. + +--- + +## Page-Level Structure + +Each page object carries `page_index` (zero-based integer, matching array position), `page_label` (the human-readable label from the PDF `/PageLabels` number tree, e.g. "iv", "A-3", "1"), `width` and `height` in points (1/72 inch), and `rotation` in degrees clockwise (0, 90, 180, or 270). The `page_type` field is a hint produced by the classifier: `text`, `scanned`, `mixed`, `blank`, or `figure_only`. This hint is informational; it does not gate access to content but signals to consumers how much confidence to assign to the extracted text. + +Within a page, content is represented at two granularities: `spans` and `blocks`. Spans are the atomic unit — individual sequences of characters sharing identical rendering properties. Blocks are semantic groupings assembled from one or more spans. Both arrays coexist on the page object. This dual representation is deliberate: applications that need character-level font and position data (accessibility auditing, forensic comparison) operate on spans directly; applications that need paragraph flow (RAG chunking, search indexing) operate on blocks. A span carries a reference by index so that block-level consumers can always descend to span-level data when needed. + +Page-level `annotations` are distinct from block content. They include highlights, stamps, sticky notes, and ink annotations, each with their own `bbox`, `subtype`, `author`, `created`, `modified`, and `contents` fields. Links (URI and internal-destination) appear in annotations as `subtype: link` with a `uri` or `dest` field rather than being mixed into the text stream. + +--- + +## Span Schema + +A span is the smallest unit of extraction output. Its fields are: `text` (the decoded Unicode string), `bbox` as a four-element array `[x0, y0, x1, y1]` in points with the coordinate origin at the lower-left of the page (PDF default), `font` (the font name as declared in the resource dictionary), `size` (the rendered glyph size in points, combining the font matrix and CTM), `color` (the fill color as a CSS hex string like `"#1a1a1a"`, or `null` if the color is not expressible as RGB, for example a spot color), `rendering_mode` (an integer 0–7 matching the PDF `Tr` operator: 0 = fill, 3 = invisible, etc.), `confidence` (a float 0.0–1.0), `confidence_source` (one of `"native"`, `"ocr"`, `"heuristic"`), `lang` (a BCP-47 language tag if detected, otherwise `null`), and `flags` (a set of strings: `"bold"`, `"italic"`, `"smallcaps"`, `"subscript"`, `"superscript"`). + +The `confidence` and `confidence_source` pair allows consumers to apply their own filtering thresholds. A span with `confidence_source: "native"` and high confidence came from decoded font mapping with no ambiguity. A span with `confidence_source: "ocr"` was produced by the raster OCR pipeline and warrants lower trust. The `rendering_mode` field is critical for invisible-text detection: text placed with `Tr 3` is present in the stream but was never intended to be visible — forensic and accessibility consumers need this distinction. + +--- + +## Block Schema + +A block aggregates spans into a semantic unit. Its fields are: `kind` (one of `paragraph`, `heading`, `table`, `list`, `figure`, `header`, `footer`, `caption`, `code`, `formula`), `text` (the concatenated plain text of all member spans, with whitespace normalized), `bbox` (the union bounding box of all member spans), `spans` (an array of span indices referencing the page-level `spans` array), `level` (an integer 1–6 for `kind: heading`, matching h1–h6 semantics, omitted for all other kinds), and `confidence` (the minimum confidence across member spans, representing the weakest link). + +The `kind` field is the primary classification signal. Consumers building a table of contents use `kind: heading` with `level`. Consumers extracting body text filter to `kind: paragraph`. The `header` and `footer` kinds identify repeated page-margin content that should typically be excluded from body-text flows. The `figure` kind marks regions where no extractable text is present but a visual element occupies the bbox — useful for flagging gaps in extraction coverage. + +--- + +## Table Output + +Tables are represented with two complementary structures. The block with `kind: table` gives the bounding box and concatenated text for downstream consumers that do not need cell structure. For consumers that do, a parallel `table` object at page level (keyed to the block index) provides the full nested structure: `rows` is an array of row objects, each containing a `cells` array. Each cell carries `text`, `bbox`, `rowspan` (default 1), `colspan` (default 1), and `is_header` (boolean, derived from tagged PDF structure or heuristic header-row detection). This separation ensures that table-aware consumers get machine-readable structure while table-unaware consumers still receive coherent concatenated text from the block. + +--- + +## Metadata Schema + +The document `metadata` object surfaces all standard PDF document information dictionary fields: `/Title`, `/Author`, `/Subject`, `/Keywords`, `/Creator`, `/Producer`, `/CreationDate`, and `/ModDate` (ISO-8601 strings). It also carries derived signals: `page_count`, `pdf_version` (e.g. `"1.7"`), `is_tagged` (boolean, true if a `/MarkInfo` dictionary with `Marked: true` is present), `is_encrypted` (boolean), `conformance` (one of `"none"`, `"PDF-A-1a"`, `"PDF-A-1b"`, `"PDF-A-2a"`, `"PDF-A-2b"`, `"PDF-A-2u"`, `"PDF-A-3a"`, `"PDF-A-3b"`, `"PDF-A-3u"`, `"PDF-UA-1"`, `"PDF-UA-2"`, `"PDF-X-1a"` — validated, not merely declared), `contains_javascript` (boolean), `contains_xfa` (boolean), and `generator` (a heuristic string identifying the producing application inferred from `/Creator` and `/Producer` patterns). XMP metadata is normalized into these same fields where it provides richer values than the document information dictionary. + +--- + +## Plain Text Output Mode + +When invoked with `--text`, pdftract emits a single UTF-8 string rather than JSON. Reading order within each page serializes blocks in top-to-bottom, left-to-right order after rotation normalization. Paragraphs are separated by double newlines. Page breaks are represented as a form feed character (`\f`) placed between pages, which is the standard convention recognized by text processing tools. Headers and footers are excluded by default; `--include-headers-footers` re-enables them. The plain text mode is a lossy projection — bbox, font, confidence, and structure are all discarded — intended for indexing pipelines that require only content, not provenance. + +--- + +## NDJSON Streaming Mode + +For large documents, the `--stream` flag activates NDJSON output: one JSON object per line, emitted as each page completes extraction. The first line is a document header frame containing the `schema_version`, `metadata`, `outline`, and a `total_pages` count. Each subsequent page frame contains a single page object in the same schema as the `pages` array entries in full-document mode, plus a `frame: "page"` discriminator field. The final line is a document footer frame (`frame: "footer"`) carrying `extraction_quality`, `errors`, `threads`, `attachments`, `signatures`, `form_fields`, and document-scoped `links` — all the fields that can only be finalized after all pages have been processed. This design allows consumers to begin processing page one while pages two through N are still being extracted, which is essential for large documents and server-side streaming APIs. + +--- + +## Error and Diagnostic Schema + +Every diagnostic event from the extraction pipeline is recorded in the `errors` array at document level. Each entry has: `code` (a stable string identifier like `"FONT_CMAP_MISSING"`, `"GLYPH_UNMAPPED"`, `"OCR_FALLBACK"`, `"XREF_REPAIRED"`, `"ENCRYPTION_UNSUPPORTED"`), `message` (a human-readable description), `page_index` (integer or `null` for document-level events), `severity` (one of `"error"`, `"warning"`, `"info"`), and `location` (an optional object with `object_number` and `generation_number` identifying the PDF indirect object where the issue originated). Error codes are namespaced by area: `FONT_*` for encoding failures, `OCR_*` for raster fallback events, `STRUCT_*` for structure tree problems, `XREF_*` for cross-reference repairs. Integration developers can key on codes programmatically rather than parsing messages, which remain subject to wording changes between releases. + +--- + +## Versioning and Stability + +The root document object carries `schema_version: "1.0"`. All fields documented here are stable in the 1.x series: their names, types, and semantics will not change in a breaking way. New fields may be added to any object in minor releases; consumers must ignore unknown fields. Fields marked with `"experimental": true` in the specification are exempt from the stability guarantee and may be removed or renamed between minor versions. + +The `extensions` object at the root level is reserved for non-breaking additions that have not yet graduated to stable status. Extension fields use a namespaced key format (`"pdftract.ocr.engine_version"`) to avoid collision with future stable fields. Consumers that rely on extension fields must treat them as experimental regardless of the version in which they appear. + +This schema is designed so that a consumer written against version 1.0 will continue to function correctly when processing output from any 1.x release, receiving richer data it may ignore rather than encountering structural incompatibilities. Major version increments (2.0, 3.0) signal breaking changes and require explicit consumer migration. diff --git a/docs/research/pdf-generator-quirks.md b/docs/research/pdf-generator-quirks.md new file mode 100644 index 0000000..8961598 --- /dev/null +++ b/docs/research/pdf-generator-quirks.md @@ -0,0 +1,151 @@ +# PDF Generator Identification and Per-Generator Extraction Quirks + +Different PDF generators leave distinctive fingerprints in the files they produce, and those fingerprints predict the extraction problems pdftract will encounter. Knowing which tool created a PDF allows the pipeline to apply targeted workarounds rather than generic fallbacks. This document covers how to identify the generator and exactly what extraction behavior to expect from each major source. + +--- + +## 1. Generator Identification + +Every PDF may carry two generator strings in its `/Info` dictionary: `/Creator` and `/Producer`. These serve distinct purposes. `/Creator` names the authoring application — the tool the human used to compose the document (Microsoft Word, Adobe InDesign, LibreOffice Writer). `/Producer` names the PDF conversion engine — the component that rendered the final byte stream (Acrobat PDFMaker, pdfTeX, Ghostscript, Quartz PDFContext). In workflows with a single tool, both fields may name the same application. In multi-step workflows (for example, Word → Distiller, or LaTeX → dvips → Ghostscript), they diverge and reveal the pipeline. + +XMP metadata in the `/Metadata` stream duplicates much of this information using `xmp:CreatorTool` and `pdf:Producer`, often with more detail than the `/Info` dictionary allows. When `/Info` strings are truncated or absent, XMP is the fallback. + +pdftract should extract and normalize both strings early in the parsing phase, before any text extraction begins, and use the normalized values to select generator-specific processing modes. Matching should be case-insensitive substring search, not exact equality, because version numbers and build identifiers vary. + +--- + +## 2. Microsoft Word (PDFMaker and Save as PDF) + +Word-produced PDFs carry `/Creator` values such as `Microsoft Word`, `Microsoft Office Word`, or simply `Word`, and `/Producer` values of `Adobe PDF Library` (when PDFMaker is used) or `Microsoft: Print To PDF` (when using the built-in Save as PDF driver introduced in Office 2013). + +**Encoding:** Word embeds ToUnicode CMaps reliably. Character identity is rarely the problem. + +**Character spacing:** Older Word versions (pre-2013) inconsistently apply the `Tc` (character spacing) operator. A non-zero `Tc` in the graphics state may persist across text objects where it should have been reset, causing pdftract to miscalculate inter-character gaps when reconstructing word boundaries. The workaround is to honor `Tc` only within the immediately enclosing `BT/ET` block and treat carry-over as a bug rather than intent. + +**Word spacing in TJ arrays:** Word frequently uses TJ arrays to encode text with embedded kerning values. These values are in thousandths of a text space unit and are typically negative (closing gaps). Positive values beyond a threshold — commonly 250 units at 1000 units/em — represent intentional word breaks and should be treated as spaces even when no explicit space character appears in the string operand. + +**Structure tree:** Word documents prior to Office 365 (version 16.0) almost never include a StructTree. Logical reading order must be inferred from geometry. Word 2016 and later can produce accessible PDFs with a partial StructTree when the author uses the Accessibility Checker and exports with the `Document structure tags for accessibility` option enabled. These StructTrees are shallower than InDesign output and may omit figure alt-text even when accessibility options are on. + +--- + +## 3. LibreOffice Writer + +LibreOffice Writer sets `/Creator` to `Writer` and `/Producer` to `LibreOffice N.N` or `OpenOffice.org N.N` for older releases. + +**ToUnicode:** Generally present and correct for Latin scripts. The failure mode is ligatures. The glyphs `fi`, `fl`, `ff`, `ffi`, and `ffl` are sometimes encoded as single-slot glyphs in the font's private encoding without a corresponding ToUnicode entry and without the `ActualText` attribute in a StructTree (which LibreOffice does not produce). The extraction result is a missing ligature character. The workaround is to identify single-glyph operands in known ligature codepoint slots and substitute the Unicode decomposition based on glyph name. + +**Word spacing:** LibreOffice sometimes omits explicit space characters between words when the inter-word gap is encoded entirely as a large negative TJ kerning value. The threshold for interpreting TJ gaps as spaces is the same as for Word, but the frequency is higher. pdftract's span-merging pass must apply this heuristic consistently to avoid run-together words in LibreOffice output. + +--- + +## 4. Adobe InDesign + +InDesign is the highest-quality PDF generator in common use. `/Creator` is `Adobe InDesign` with a version number; `/Producer` is `Adobe PDF Library`. + +**Encoding and structure:** Accessible InDesign exports (File → Export → PDF → with `Create Tagged PDF` enabled) produce well-formed StructTrees with `ActualText` on ligature spans, role maps for custom tag names, and article threads for multi-column reading order. ToUnicode CMaps are always present and correct. + +**Optical kerning:** InDesign's optical kerning algorithm produces large numbers of small TJ adjustments — often individual character pairs with sub-5-unit corrections. These are legitimate and should not be misinterpreted as word breaks. pdftract's gap threshold logic must use a higher threshold (around 500–600 units at 1000 units/em) when it detects InDesign output to avoid false word-break insertions between tightly-set glyphs. + +**Spot colors:** InDesign preserves spot color separations (Pantone, custom inks) in DeviceN and Separation color spaces. This is irrelevant for text extraction but can cause confusion if the pipeline attempts to rasterize pages for OCR confidence scoring — the DeviceN color values will not render correctly without the spot color lookup table. + +**Article threads:** Older InDesign exports (pre-CS6) encode reading order for multi-column layouts as article threads in the `/Threads` array rather than in the StructTree. pdftract should extract article threads as a fallback reading-order source when the StructTree is absent or incomplete. + +--- + +## 5. LaTeX (pdflatex, LuaLaTeX, XeLaTeX, and dvips) + +LaTeX generator detection is covered in depth in `latex-and-scientific-pdf-patterns.md`. The relevant `/Producer` strings are: `pdfTeX-N.N` for pdflatex, `XeTeX` for xelatex, `LuaTeX` for lualatex, and `GPL Ghostscript` or `Acrobat Distiller` for the legacy dvips pipeline. + +**dvips artifacts:** The `latex` → `dvips` → `ps2pdf` pipeline produces PDFs with no ToUnicode CMaps. Ghostscript does not synthesize them from the PostScript source. Character identity must be recovered entirely from glyph names and font encoding vectors. Very old dvips output may also include Type 3 fonts built from PK bitmap rasterizations of Metafont glyphs; these have no outline and no reliable glyph name. pdftract must fall back to raster OCR for pages dominated by such fonts. + +**hyperref metadata:** When the hyperref package is loaded, it populates `/Info` fields (`/Title`, `/Author`, `/Subject`, `/Keywords`) and creates a PDF outline (bookmarks) from section headings. This is useful for extraction — bookmarks can supplement or replace geometric heading detection. However, hyperref also emits PDF destinations for every `\label`, which multiplies the number of named destinations in the cross-reference dictionary; pdftract should not attempt to extract those destinations as meaningful text. + +--- + +## 6. Google Docs and Google Slides + +Google Docs exports carry `/Creator` of `Google Docs` or `Google Slides` and a `/Producer` string beginning with `Skia/PDF` followed by a build milestone number (for example, `Skia/PDF m128`). This overlaps with Chrome's producer string; the `/Creator` field disambiguates. + +**Unicode and encoding:** Google's export engine produces correct ToUnicode CMaps. Character identity is reliable. + +**Header and footer duplication:** Google Docs repeats header and footer content on every page as independent text streams with no structural marker distinguishing them from body text. The text appears at the top and bottom of each page at consistent Y coordinates. pdftract should detect repeated text blocks at fixed page-relative positions across three or more consecutive pages and classify them as headers or footers, suppressing duplicates in continuous extraction output. + +**Inline images:** Images in Google Docs PDFs are always converted to JPEG and inlined in the content stream. They are not referenced as XObject Form resources. This means image extraction must scan inline image operators (`BI`/`EI`) in addition to `Do` operators. + +**Structure tree:** Google Docs and Slides do not emit StructTrees. Reading order is entirely geometry-driven. + +--- + +## 7. macOS Print to PDF (Core Graphics / Quartz) + +macOS system-level PDF generation sets `/Producer` to `Mac OS X N.N.N Quartz PDFContext`. The `/Creator` is the application that initiated the print job. + +**Font handling:** Core Graphics subsets fonts aggressively, retaining only the glyphs used in the document. Subset names carry the standard six-character uppercase prefix. ToUnicode CMaps are present and correct for all text. + +**Page thumbnails:** Quartz-generated PDFs frequently embed page thumbnails as JPEG images in the `/Thumb` entry of each page dictionary. These are rendering artifacts and should not be processed as content. + +**Quality:** Quartz output is generally clean. The main extraction challenge arises when the printing application does not expose logical text to the PDF layer — for example, when a canvas-based web application prints via WebKit, the output may be paths rather than text operators. + +--- + +## 8. Ghostscript + +Ghostscript (`/Producer` beginning with `GPL Ghostscript N.N`) typically appears as a downstream converter, transforming PostScript into PDF. It may also appear as the engine in Linux print-to-PDF and in some server-side document conversion pipelines. + +**Encoding errors:** When converting PostScript that uses Symbol or Dingbats fonts, Ghostscript sometimes misidentifies glyph slots during re-encoding, producing incorrect character substitutions. A Symbol font encoded with the standard Symbol encoding should map slot 0x61 to the alpha character (U+03B1); Ghostscript has been observed mapping some slots to their Latin equivalents instead. pdftract should treat any text run in a font named `Symbol` or `ZapfDingbats` as suspect and apply the canonical encoding table rather than trusting the embedded ToUnicode. + +**Type 3 promotion:** When Ghostscript converts PostScript Type 1 fonts it cannot fully resolve, it may re-emit them as Type 3 fonts with charproc streams. These Type 3 glyphs do not carry glyph names and require shape-based recovery. Detection: font `/Subtype` is `Type3` and `/FontMatrix` is not the identity matrix. + +**No ToUnicode synthesis:** Ghostscript does not add ToUnicode CMaps to PostScript-derived content that lacked them. If the upstream PostScript had no encoding information, the PDF will not either. dvips-to-Ghostscript output is the canonical case. + +--- + +## 9. Browser Print-to-PDF (Chrome/Chromium and Firefox) + +**Chrome:** `/Producer` is `Skia/PDF mXX` where XX is a Chromium milestone number. `/Creator` is absent or set to the page title. Chrome's PDF renderer is based on the Skia graphics library and produces clean ToUnicode CMaps. + +**Firefox:** `/Producer` is `Mozilla/N.N` with a version number. + +**Fragmented text runs:** Both browsers may decompose text into single-character `Tj` operations in some rendering paths, particularly for complex CSS typography (letter-spacing, text-shadow, mixed bidirectional content). A paragraph that reads as one logical span in the DOM becomes dozens of individual positioned glyphs in the PDF. pdftract's span-merging pass must reconstruct these into word and line sequences by clustering glyphs whose inter-character gaps fall within a font-size-relative threshold. The merge step should apply before any word-boundary heuristics, not after. + +**Baseline variation:** Web pages with inline SVG or mixed font sizes can produce text runs with small vertical offsets within a single visual line. The line-grouping pass should use a tolerance band of roughly 20% of the dominant font size when assigning characters to the same text line. + +--- + +## 10. Scanning Software and OCR Layers + +Scanned PDFs produced by NAPS2, Adobe Scan, Microsoft Office Lens, and similar tools carry an invisible text layer rendered with text rendering mode 3 (`Tr 3` — neither filled nor stroked). The background is a raster image; the text layer is OCR output aligned to the image. + +**Producer strings:** +- NAPS2: `/Producer` is `NAPS2` with a version +- Adobe Scan: `/Creator` contains `Adobe Scan` +- Office Lens: `/Creator` contains `Microsoft Office Lens` +- Generic OCR pipelines using ABBYY FineReader may report `/Producer` as `ABBYY FineReader` +- Tesseract-based pipelines (including some open-source scan apps) report `/Producer` as `tesseract N.N.N` + +**OCR engine quality:** ABBYY FineReader output typically has higher character-level accuracy and better word-spacing reconstruction than Tesseract, particularly for non-Latin scripts and degraded print. Apple Vision (used in iOS scan apps) is competitive with ABBYY for English. Tesseract output requires more aggressive post-OCR normalization. + +**Confidence signals:** Tesseract embeds per-word confidence values in the invisible text layer as custom ActualText or via font size variation tricks. ABBYY encodes confidence differently and less consistently. pdftract should compute its own confidence signal for scan layers: the ratio of recognizable Unicode characters to total character count in the `Tr 3` layer, cross-checked against the visual character density in the corresponding raster region. A high-confidence scan layer can be used directly; a low-confidence one should trigger a re-OCR pass using pdftract's internal raster pipeline. + +**Text alignment:** OCR-placed text in scan PDFs is positioned to match the corresponding raster glyph but may use a single monospace font regardless of the original typeface. Inter-word gaps are encoded as explicit space characters rather than TJ kerning, which makes word boundary reconstruction straightforward — the OCR engine has already done it. The primary extraction task is simply reading the `Tr 3` text stream, filtering out the rendering mode, and normalizing the spacing. + +--- + +## Summary: Generator Detection to Extraction Mode + +| `/Producer` pattern | Likely source | Key extraction concerns | +|---|---|---| +| `Adobe PDF Library` + `/Creator` Word | Word via PDFMaker | Tc carry-over, TJ word gaps, no StructTree pre-365 | +| `Microsoft: Print To PDF` | Word Save as PDF | As above, lighter kerning | +| `LibreOffice N.N` | LibreOffice Writer | Ligature gaps, TJ space encoding | +| `Adobe PDF Library` + `/Creator` InDesign | InDesign | Optical kerning threshold, article threads | +| `pdfTeX-N.N` | pdflatex | OT1 encoding, partial ToUnicode | +| `XeTeX` / `LuaTeX` | xelatex / lualatex | Good Unicode, math block mapping | +| `GPL Ghostscript` | Ghostscript / dvips | No ToUnicode, Type 3 promotion, Symbol re-encoding | +| `Skia/PDF mXX` + no Creator | Chrome | Fragmented single-char Tj runs | +| `Mozilla/N.N` | Firefox | Fragmented runs, baseline variation | +| `Mac OS X N.N.N Quartz PDFContext` | macOS print | Clean output, thumbnail noise | +| `Google Docs` / `Google Slides` (Creator) | Google export | Header/footer dedup, inline JPEG | +| `ABBYY FineReader` / `tesseract N.N` | Scan + OCR | Tr 3 layer, confidence scoring | + +Detecting the generator is a one-time operation at parse time and costs negligible overhead. The payoff is that every subsequent heuristic — gap thresholds, ligature substitution, StructTree reliance, span merging aggressiveness — can be tuned to the actual source rather than a generic average. pdftract should treat `/Producer` detection as a first-class preprocessing step, not an optional diagnostic. diff --git a/docs/research/pdfa-archival-extraction-guarantees.md b/docs/research/pdfa-archival-extraction-guarantees.md new file mode 100644 index 0000000..25a6f91 --- /dev/null +++ b/docs/research/pdfa-archival-extraction-guarantees.md @@ -0,0 +1,111 @@ +# PDF/A Archival Format Extraction Guarantees and Fast-Path Optimization + +## Overview + +PDF/A is a family of ISO standards that constrain the PDF specification to ensure long-term document preservation and reproducibility. For a text extraction library, PDF/A conformance is not merely a metadata curiosity — it is a contractual statement about what the document contains and how it is encoded. Each PDF/A level carries specific structural guarantees that pdftract can exploit to choose faster, more confident extraction paths, skipping heuristics and fallbacks that are only necessary for unconstrained PDFs. + +Understanding what each conformance level actually guarantees — and what it does not — is the foundation for building a reliable fast-path dispatcher. + +--- + +## PDF/A-1: The Baseline Contract + +PDF/A-1 (ISO 19005-1), based on PDF 1.4, defines two conformance levels with meaningfully different extraction implications. + +**Level B (Basic)** establishes the minimum floor: all fonts must be embedded in their entirety, including the font program and font descriptor, so that rendering is never dependent on a system-installed substitute. PDF encryption is prohibited. Transparency is disallowed, and all color must be device-independent (CMYK or ICC-tagged). XMP metadata is required in the document catalog. Level B does not require a logical structure tree, does not mandate ToUnicode CMaps beyond what is needed for rendering, and does not require language tagging. + +For extraction, Level B means that font substitution is impossible — pdftract can trust that whatever ToUnicode CMap is present is the actual encoding used during rendering. However, because structure trees are optional at Level B, reading order must still be inferred from the glyph position stream. Layout heuristics remain necessary, but the encoding layer is reliable. + +**Level A (Accessible)** adds a full set of structural requirements on top of Level B. All fonts must include complete ToUnicode CMaps covering every glyph present in the document. A logical structure tree must be present and must reflect the natural reading order of the content. Every text run in the content stream must be associated with a structure element. Non-standard glyphs — ligatures, decorative characters, or any glyph whose Unicode mapping is ambiguous — must carry ActualText attributes in the structure tree. Language tags at both the document and element level are required. + +Level A is the gold standard for extraction. The structure tree provides reading order directly, the ToUnicode CMaps provide authoritative character mapping, and ActualText resolves all encoding ambiguities without glyph shape analysis. At Level A, pdftract has everything it needs to perform extraction via pure structure traversal, with zero reliance on layout geometry. + +--- + +## PDF/A-2: Extended Capabilities, Familiar Guarantees + +PDF/A-2 (ISO 19005-2, PDF 1.7) extends the Level A/B distinction and introduces several PDF 1.7 features that were not available in PDF/A-1. + +JPEG2000 (JPX) image compression is now permitted, alongside all existing PDF 1.7 compression types. Optional Content Groups (OCGs) are allowed, provided that all content is visible in the default view — a conforming reader must not need to toggle any layer to see the document as intended. Transparency and blending modes are permitted, as is the use of PDF 1.7 digital signatures. Embedded file attachments are not supported at this level. PDF/A-2 also allows PDF/A-1 compliant documents to be embedded as attachments, enabling composite archival bundles. + +For extraction, PDF/A-2 adds one important consideration: OCGs. Even though the standard requires all content to be visible by default, pdftract must be aware that content stream objects may be wrapped in optional content markers (`/OC` dictionary references). When traversing structure elements, pdftract should resolve OCG visibility using the default configuration (`/D` entry in the `/OCProperties` dictionary) and skip any content marked as off by default. Ignoring this layer means extracting text that users would not see in a standard rendering. + +The Level A and Level B guarantees carry forward unchanged from PDF/A-1. A PDF/A-2a document still guarantees a complete structure tree, ToUnicode coverage, and ActualText where needed. A PDF/A-2b document still guarantees font embedding without requiring structure. + +--- + +## PDF/A-3: Arbitrary Attachments and the ZUGFeRD Pattern + +PDF/A-3 (ISO 19005-3, PDF 1.7) is structurally identical to PDF/A-2 with one significant addition: it permits arbitrary file attachments via the embedded file mechanism, with any MIME type, provided that each attachment carries an `AFRelationship` key in its embedded file stream dictionary. + +The primary use case driving PDF/A-3 adoption is hybrid invoice formats such as ZUGFeRD and Factur-X, where a human-readable PDF invoice is paired with a machine-readable XML attachment (typically `factur-x.xml` or `ZUGFeRD-invoice.xml`) carrying the same financial data in a structured electronic form. The `AFRelationship` value in these documents is typically `/Alternative`, indicating that the attachment is a full-fidelity alternative representation of the visual content. + +For pdftract, PDF/A-3 introduces an extraction opportunity beyond plain text: when an embedded file with `AFRelationship /Alternative` or `/Source` is detected, the structured data in the attachment may be more semantically rich than what can be extracted from the visual layer. pdftract should surface embedded file metadata — including file name, MIME type, and `AFRelationship` value — alongside the text extraction result so that callers can decide whether to consume the attachment directly. + +The Level A and Level B extraction guarantees for the visual layer are identical to PDF/A-2. + +--- + +## PDF/A-4: A Restructured Conformance Hierarchy + +PDF/A-4 (ISO 19005-4, PDF 2.0) abandons the Level A/B distinction in favor of three new levels aligned with PDF 2.0 capabilities. + +**Level F (Full)** permits attached files with `AFRelationship` labels, similar to PDF/A-3, and requires PDF 2.0 conformance throughout. It does not mandate a logical structure tree. This is the base level, analogous to Level B in earlier versions but without the carryover of the A/B vocabulary. + +**Level E (Engineering)** is an extension of Level F intended for engineering and technical documents. It adds requirements specific to technical drawing workflows but does not fundamentally change the extraction guarantee set compared to Level F. + +**Level U (Unencrypted)** explicitly prohibits encryption and is intended for environments where unobstructed long-term access is a hard requirement. It does not add structure tree requirements beyond what Level F establishes. + +Notably, PDF/A-4 does not have a dedicated accessibility level equivalent to the old Level A. Accessibility requirements in PDF 2.0 are addressed by the PDF/UA-2 standard (ISO 14289-2) rather than being embedded in the archival standard. A PDF/A-4 document that also satisfies PDF/UA-2 carries both conformance claims in its XMP metadata, and that combination is the PDF/A-4 equivalent of the old Level A for extraction purposes. + +For pdftract, this means that detecting full extraction confidence for a PDF/A-4 document requires checking for both the PDF/A-4 conformance claim and a PDF/UA-2 conformance claim in the XMP metadata. A PDF/A-4 document without the PDF/UA-2 pairing should be treated like Level B: font embedding is reliable, but structure-tree extraction cannot be assumed. + +--- + +## Conformance Detection in pdftract + +PDF/A conformance is declared in two complementary locations that pdftract must both inspect. + +The `/OutputIntents` array in the document catalog contains one or more output intent dictionaries. A PDF/A conforming document includes an output intent with `/S /GTS_PDFA1` (for PDF/A-1), `/S /GTS_PDFA2` (for PDF/A-2 and PDF/A-3), or `/S /GTS_PDFA4` (for PDF/A-4). The presence of this key provides a fast structural signal of conformance intent, though it is not the authoritative source. + +The authoritative source is the XMP metadata stream embedded in the document catalog's `/Metadata` stream. A conforming PDF/A document must include `pdfaid:part` (the integer version number: 1, 2, 3, or 4) and `pdfaid:conformance` (a single uppercase letter: A, B, F, E, or U) in the XMP namespace `http://www.aiim.org/pdfa/ns/id/`. pdftract should parse these two fields directly from the raw XMP XML rather than relying on an intermediate metadata abstraction, since encoding errors in higher-level parsers can silently misreport conformance. + +When both sources are present, they should agree. Divergence between the `/OutputIntents` signal and the XMP claim is itself an indicator of a non-conformant document. + +--- + +## The Level A Fast Path + +For a document confirmed to be PDF/A-1a, PDF/A-2a, or PDF/A-3a (or PDF/A-4 with PDF/UA-2), pdftract can activate the structure-tree fast path: + +1. Traverse the structure tree using the `/StructTreeRoot` entry in the document catalog. +2. Walk the tree in document order, collecting `Span`, `P`, `H`, `L`, `Table`, and other leaf elements. +3. For each marked content reference (`/MCID`), resolve the corresponding content stream segment and decode characters using the ToUnicode CMap of the active font. +4. For any element carrying an `ActualText` attribute, use the ActualText value directly rather than decoding from glyphs. +5. Use the language tags at each element to annotate the extracted text spans. + +This path bypasses glyph shape matching entirely, bypasses OCR (since all text is already encoded), and bypasses all layout heuristics for reading order. In practice, the structure-tree path is dramatically faster — typically an order of magnitude or more — compared to a geometry-based extraction pipeline, because it operates on the logical tree rather than the dense coordinate space of the content stream. + +--- + +## Level B Extraction and Confidence Calibration + +For Level B documents, pdftract takes a partially accelerated path. Font embedding guarantees mean that character decoding via ToUnicode CMaps is reliable — there is no risk that a missing font causes systematic encoding failure. However, the absence of a structure tree means reading order must be reconstructed from the glyph position stream using standard layout analysis: line grouping by vertical proximity, column detection, and reading-order sorting. + +The extraction confidence score for a Level B document should be set lower than for Level A, reflecting the fact that reading order is inferred rather than specified. The character-level accuracy can still be very high, but structural accuracy (paragraph boundaries, column order, footnote placement) is heuristic. + +--- + +## Validating Conformance Claims + +A significant minority of production PDFs claim PDF/A conformance through the `/OutputIntents` or XMP mechanism but would fail validation by a conformance checker. These documents may have been produced by tools that stamp a conformance claim without verifying the underlying document structure. + +pdftract should treat the conformance claim as a hypothesis to be partially verified rather than a fact to be accepted. The key checks are: (1) at least one font embedded in the `/Resources` dictionary of each page that renders text; (2) no `/Encrypt` dictionary present in the document catalog; (3) a `/Metadata` stream present with parseable XMP; and (4) for Level A, a `/StructTreeRoot` entry present in the document catalog. + +If any of these checks fails on a document claiming PDF/A Level A, pdftract should downgrade the extraction path — falling back to Level B treatment if the structural tree is absent, or to unconstrained PDF treatment if fonts appear unembedded. The downgrade should be recorded in the extraction result's metadata so that callers can understand why the fast path was not taken and investigate the source document if needed. + +--- + +## Summary + +PDF/A conformance levels form a spectrum of structural guarantees that pdftract can translate directly into extraction strategy. Level A across all version families (PDF/A-1a through PDF/A-3a, and PDF/A-4 paired with PDF/UA-2) provides the complete extraction contract: structure tree, ToUnicode CMaps, ActualText, and language tags. This enables a pure structure-traversal fast path that is significantly faster and more accurate than geometry-based extraction. Level B and its PDF/A-4 equivalents guarantee font embedding and encoding reliability but require layout heuristics for reading order. Non-conformant documents claiming PDF/A status must be detected through structural cross-checks and routed to the appropriate fallback path rather than silently receiving a fast path they have not earned.