Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms
Four new extraction research documents covering PDF portfolio and attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental update structure and xref chaining, PDF/UA tagged PDF deep dive with all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA field extraction without script execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
006dfb286c
commit
5ff918b178
4 changed files with 440 additions and 0 deletions
81
docs/research/accessibility-and-tagged-pdf-deep-dive.md
Normal file
81
docs/research/accessibility-and-tagged-pdf-deep-dive.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
# Accessibility, Tagged PDF, and PDF/UA Compliance — Deep Technical Dive
|
||||
|
||||
## Overview
|
||||
|
||||
PDF/UA-1 (ISO 14289-1) defines the minimum technical requirements for universally accessible PDF documents. Where PDF/A concerns itself primarily with archival fidelity and self-containment, PDF/UA imposes a much stricter contract on the logical structure of the document: every piece of content must be accounted for in the structure tree, reading order must be derivable from that tree without ambiguity, and non-text elements must carry textual descriptions. For pdftract, a PDF/UA-conformant document represents the ideal input — a document that provides a complete, authoritative map from structure to content stream bytes, eliminating the need for heuristic reconstruction of reading order or semantic grouping.
|
||||
|
||||
---
|
||||
|
||||
## 1. PDF/UA Requirements and What They Give pdftract
|
||||
|
||||
Under PDF/UA-1, every content item in a page's content stream must either be tagged — meaning it appears as the marked content of a structure element reachable from the structure tree — or explicitly marked as an artifact. Nothing is allowed to be anonymous. This requirement transforms the structure tree from an optional annotation layer into a complete index of all page content.
|
||||
|
||||
For extraction, this has three direct consequences. First, the structure tree can be traversed depth-first to yield a guaranteed logical reading order that is independent of content stream ordering. In a non-tagged PDF, the content stream order reflects the painter's model (back-to-front rendering order), which rarely matches human reading order, particularly in multi-column layouts, footnoted pages, or documents with sidebars. PDF/UA eliminates this ambiguity. Second, artifacts can be identified and excluded categorically rather than heuristically. Third, the combination of ActualText, Alt, and E attributes on structure elements provides machine-readable text alternatives for content that would otherwise require glyph-to-Unicode mapping or OCR. pdftract should treat PDF/UA conformance as a capability flag that, when set, unlocks a higher-confidence extraction path.
|
||||
|
||||
---
|
||||
|
||||
## 2. Standard Structure Types and Semantic Extraction
|
||||
|
||||
The PDF specification defines a fixed set of standard structure types. pdftract must recognize all of them and map each to an appropriate extraction result. Grouping elements — Document, Part, Art, Sect, Div — establish the document hierarchy and produce no direct text output but define scope for attribute inheritance and section boundary detection. BlockQuote signals an indented quotation; extraction should preserve it as a distinct block with a semantic role annotation. Caption, when it is a child of Table or Figure, binds a text string to a non-text element. TOC and TOCI represent the table of contents and its individual entries; pdftract can reconstruct a structured outline from these without parsing page numbers from the visual layout.
|
||||
|
||||
Index and NonStruct are notable edge cases. Index groups index entries but does not itself constitute body text. NonStruct is a structure element with no semantic role — it exists purely as a grouping convenience and should be treated transparently, passing its children's content through without adding semantic meaning. Private is similar but signals proprietary structure; extraction should recurse into it without assuming any meaning.
|
||||
|
||||
Among the inline and block content types: P is a paragraph; H and H1–H6 are headings at a specific outline level; L, LI, Lbl, and LBody form the list model where Lbl holds the bullet or number and LBody holds the list item's paragraph content; Table, TR, TH, and TD implement the table model with optional THead, TBody, and TFoot groupings for header/body/footer row groups. Span groups inline content; Quote marks an inline quotation; Note is a footnote or endnote and should be extracted separately from the paragraph that references it; Reference is a citation; BibEntry is a bibliographic entry. Code marks programmatic text. Figure, Formula, and Form are non-text elements for which Alt text is the primary extraction target.
|
||||
|
||||
For every element type, pdftract's structure tree walker must map the tag name to one of these categories, apply the appropriate block or inline model, and produce a typed output node rather than a flat text string.
|
||||
|
||||
---
|
||||
|
||||
## 3. MCID: The Link from Structure Tree to Content Stream
|
||||
|
||||
Marked Content Identifiers (MCIDs) are the mechanism that connects structure tree leaf nodes to the actual bytes in a content stream. A structure element's content array may contain MCID references (integer dictionaries with /MCID and /Pg keys). On the content stream side, the operators BDC (Begin Marked Content with a property dictionary) and BMC (Begin Marked Content without properties) open marked content sequences, and EMC closes them. A BDC operator with a /MCID entry in its property dictionary creates a named marked content sequence; the MCID value must match a reference in the structure tree.
|
||||
|
||||
pdftract's extraction pipeline must build a two-way index: a forward map from (page, MCID) to the content stream byte range, and a reverse map from that byte range back to the structure element. The actual text bytes — glyphs, glyph widths, font encoding — are extracted from the content stream in the usual manner, but the order in which they are assembled is determined by the structure tree traversal, not by content stream position. This is the core inversion that distinguishes tagged PDF extraction from untagged PDF extraction. For each leaf structure element, pdftract collects all MCID references, resolves each to a content stream segment, extracts the text from each segment using the active font's encoding and ToUnicode CMap, and concatenates the results in MCID order within the element.
|
||||
|
||||
---
|
||||
|
||||
## 4. ActualText: Overriding Character Codes
|
||||
|
||||
The ActualText attribute, when present on a structure element or on a marked content sequence property dictionary, provides a verbatim Unicode string that replaces the decoded character sequence from the content stream. pdftract must check for ActualText before performing any glyph-to-Unicode mapping on a segment. If present, the stream bytes are treated as opaque rendering instructions, and the ActualText value is the extracted text.
|
||||
|
||||
ActualText is critical for ligatures (the glyph U+FB01 "fi" may be encoded as a single code point or as two code points with a single glyph; ActualText ensures "fi" appears in extraction), for accessible mathematics (equation renderers often encode symbols in private-use areas and provide ActualText with the correct Unicode representation), and for stylized text (decorative fonts with non-standard encodings). pdftract's MCID resolver should apply an ActualText check as the first step before falling through to encoding-based extraction.
|
||||
|
||||
---
|
||||
|
||||
## 5. Alt Text: Extraction from Non-Text Elements
|
||||
|
||||
The Alt attribute on Figure, Formula, Form, and other non-text structure elements provides a text alternative intended for screen reader users. For pdftract, Alt text is a first-class extraction target. When a Figure element is encountered, the extraction result should include the Alt value as a text node annotated with a role of "alt-text" rather than silently dropping it or treating the element as empty. This enables downstream consumers — search indexers, summarizers, accessibility auditors — to include figure descriptions in their text model.
|
||||
|
||||
---
|
||||
|
||||
## 6. E (Expansion) Attribute: Abbreviation Resolution
|
||||
|
||||
The E attribute on Span elements provides an expansion for an abbreviation or acronym. If a Span element containing the text "WHO" carries /E "World Health Organization", the expansion is the semantically correct text for extraction contexts that prioritize meaning over surface form. pdftract should expose both: the surface text (from the content stream or ActualText) and the expansion (from E), allowing callers to choose which to use. A technical extraction mode might return "WHO" for exact-match indexing; a semantic mode would substitute or append "World Health Organization".
|
||||
|
||||
---
|
||||
|
||||
## 7. RoleMap: Resolving Custom Structure Types
|
||||
|
||||
PDF allows documents to define custom structure element names in the document catalog's /RoleMap dictionary, mapping each custom name to a standard type. A document might use /Section mapped to /Sect or /Callout mapped to /Note. pdftract must resolve all structure element names through the RoleMap before applying extraction semantics. The resolution algorithm is: if the element name appears in RoleMap, substitute the mapped name and repeat until a standard type is reached or a cycle is detected. Cycle detection is required because malformed documents can create circular RoleMap entries. Unresolvable names should be treated as NonStruct — transparent grouping with no semantic role.
|
||||
|
||||
---
|
||||
|
||||
## 8. Artifact Marking: Excluding Non-Body Content
|
||||
|
||||
Content marked with the /Artifact tag type (using BMC or BDC with /Artifact as the tag name) falls outside the structure tree by definition. PDF/UA defines four artifact subtypes: Header, Footer, Background, and Page. pdftract must detect artifact-marked content sequences and route them to a separate extraction bucket, not the body text stream. For most extraction use cases, headers and footers are noise; providing them as optional annotated output rather than suppressing them entirely gives callers the flexibility to include or exclude them. Background and Page artifacts should be excluded by default since they represent decorative or layout elements with no textual value.
|
||||
|
||||
---
|
||||
|
||||
## 9. Attribute Inheritance in the Structure Tree
|
||||
|
||||
PDF structure attributes propagate from parent elements to descendants unless overridden. The /Lang attribute is the most consequential for extraction: a document in English with a single /Lang "en-US" on the Document root propagates that language to every element. A /Sect element in French within an English document carries /Lang "fr-FR", which propagates to all /P and /Span descendants within that section. pdftract must implement attribute inheritance as a stack-based operation during structure tree traversal, pushing the active attribute set when entering an element and popping it on exit. Inherited attributes that matter for extraction include Lang (for language-aware text processing and hyphenation), WritingMode (for right-to-left or vertical text assembly), and any custom attributes conveyed via class maps.
|
||||
|
||||
---
|
||||
|
||||
## 10. Fallback for Partially-Tagged PDFs
|
||||
|
||||
Many PDFs claim PDF/UA conformance but deliver incomplete structure trees. Common failure modes include: structure elements with no MCID references (orphaned nodes with no content), content stream segments with MCIDs that have no corresponding structure tree entry (orphaned content), and artifacts that are not marked in the content stream but are not tagged in the structure tree either.
|
||||
|
||||
pdftract's fallback strategy must be layered. The first pass attempts full structure-tree-driven extraction: resolve all MCIDs, collect all content, and verify that all content stream text operators are accounted for. If unaccounted content remains — text-drawing operators not associated with any MCID — the fallback activates. Untagged text segments are extracted using the content stream ordering heuristic: sort by vertical position (descending) and then horizontal position (ascending), grouping by proximity into synthetic paragraph blocks. These blocks are emitted with a provenance annotation indicating they originated from the fallback path, allowing callers to treat them with reduced confidence. If the structure tree is so incomplete that fewer than a threshold percentage of content stream text is accounted for, pdftract should demote the document to untagged-extraction mode entirely rather than producing a mixed output where structured and unstructured content is interleaved without clear boundaries.
|
||||
|
||||
The practical implication is that pdftract maintains two extraction pipelines sharing a common content stream reader: a structure-tree-driven pipeline for tagged documents and a heuristic pipeline for untagged documents, with a validation pass that determines which pipeline to engage and whether a hybrid fallback is appropriate.
|
||||
67
docs/research/incremental-updates-and-versioning.md
Normal file
67
docs/research/incremental-updates-and-versioning.md
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
# PDF Incremental Updates, Object Revisions, and Version-Aware Extraction
|
||||
|
||||
## Overview
|
||||
|
||||
PDF documents are not always written in a single pass. The format was designed to support non-destructive modification: content can be appended to the end of a file without touching the bytes that precede it. This mechanism, known as an incremental update, is how form fills, annotation additions, and digital signatures are layered onto an existing document. A parser that treats the first `%%EOF` as the end of meaningful content will silently discard everything after it — form data, reviewer comments, correction overlays, and signature metadata included. pdftract must parse the complete file as a sequence of revisions and merge all update layers into a coherent object table before any extraction begins.
|
||||
|
||||
## Incremental Update Structure
|
||||
|
||||
Each PDF revision consists of three parts appended sequentially: a body section of new or modified objects, a cross-reference (xref) section describing their byte offsets, and a trailer dictionary followed by `startxref` and `%%EOF`. When a file is incrementally updated, the original body, xref, and trailer are left intact; the new revision is concatenated at the end. A file may therefore contain multiple `%%EOF` markers, each marking the end of one revision.
|
||||
|
||||
Each appended trailer carries a `/Prev` key pointing to the previous revision's xref. This creates a backward-pointing chain from the most recent revision to the original. Walking the chain from the final `startxref` and collecting all object entries — with later definitions superseding earlier ones — yields the complete, current object table.
|
||||
|
||||
## Why This Matters for Extraction
|
||||
|
||||
Interactive PDF forms store field values as objects. When a user fills out a form in a reader application, the reader does not rewrite the original document; it appends an incremental update containing the modified AcroForm field objects. Annotations — highlights, sticky notes, ink marks — are added in exactly the same way. Revision corrections submitted by document management systems follow the same pattern.
|
||||
|
||||
A parser that halts at the first `%%EOF` will see the blank form as it was distributed, with no field values populated. It will see none of the annotations. Any extraction output will be incomplete, and because the document will otherwise appear structurally valid, the failure will be silent. pdftract must locate all `%%EOF` markers in the file, enumerate every appended revision, and resolve the full xref chain before building the object map that drives content extraction.
|
||||
|
||||
## Cross-Reference Table Chaining
|
||||
|
||||
The traditional cross-reference table is a plain-text section beginning with the keyword `xref`, containing one or more subsections headed by a first-object-number and count, followed by 20-byte entries encoding byte offset, generation number, and an in-use or free flag.
|
||||
|
||||
Reconstructing the complete object table requires merging all xref sections in reverse chronological order. The algorithm starts at the final `startxref` offset, reads the xref table there, records its entries, reads the `/Prev` offset from its trailer, and repeats until `/Prev` is absent. Entries from a newer revision are never overwritten. The result is a map from object number to the byte offset of its most recent — and therefore correct — definition.
|
||||
|
||||
## Cross-Reference Streams (PDF 1.5 and Later)
|
||||
|
||||
Beginning with PDF 1.5, the cross-reference table may be replaced by a cross-reference stream: a stream object with `/Type /XRef`. This compresses well and is used by most modern authoring tools. pdftract must handle both formats and be prepared for a mix within the same file — an original with traditional xref tables may be updated by a tool that appends a compressed xref stream.
|
||||
|
||||
Parsing a cross-reference stream requires reading three dictionary fields. The `/W` array contains three integers specifying the byte widths of each field in the binary stream entries: the entry type, the offset or object-stream index, and the generation number or index within an object stream. The `/Index` array, if present, specifies the object number ranges described by the stream entries; its absence implies a single range starting at zero with length equal to the `/Size` value. Entry types are defined as: 0 for a free object (the offset field holds the next free object number), 1 for an uncompressed in-use object (the offset field is the byte offset of the object in the file), and 2 for an object compressed inside an object stream (the offset field is the object number of the containing stream, and the generation field is the object's index within that stream). The `/Prev` chaining mechanism is identical to traditional xref tables.
|
||||
|
||||
## Object Streams (ObjStm)
|
||||
|
||||
PDF 1.5 also introduced object streams, stream objects of type `/ObjStm`, which pack multiple PDF objects into a single compressed stream. This reduces file size but adds a layer of indirection for any parser that needs to locate an individual object.
|
||||
|
||||
An object stream's dictionary contains `/N` (count of packed objects) and `/First` (byte offset within the decoded stream where the first object's content begins). Before `/First` is a plain-text index of N pairs — object number and byte offset relative to `/First`. pdftract decompresses the stream, parses this index, and extracts any individual object on demand. Objects inside an object stream carry no generation number; they inherit generation zero unless the xref entry specifies otherwise.
|
||||
|
||||
## Tracking Object Revisions
|
||||
|
||||
Because xref entries record both object number and generation number, pdftract can determine not just the current state of an object but the sequence in which it changed. Each time an object is deleted and its number recycled, the generation number increments. By collecting xref entries from all revisions — not only the most recent — pdftract can present a complete revision history: which revision introduced each generation, its byte offset, and when the object was freed.
|
||||
|
||||
This per-object revision tracking is particularly useful for form field extraction. A document distributed to a reviewer, filled, annotated by a second reviewer, and corrected by a third will contain at least three incremental updates. Each form field's value object will appear in the xref of the revision that last modified it. By cross-referencing object revision data with update timestamps from the document's XMP metadata or Info dictionary, pdftract can report when each field was filled and distinguish original content from subsequent corrections.
|
||||
|
||||
## Deleted Objects and the Free List
|
||||
|
||||
When an object is deleted in an incremental update, the xref gains a free-list entry with an incremented generation number. pdftract must tolerate indirect references to deleted objects without crashing — the correct behavior is to return a null object, consistent with the PDF specification's treatment of free-object references. The content model built on the raw object map must handle null gracefully wherever an optional indirect reference may resolve to a deleted object.
|
||||
|
||||
## Signature ByteRange and Unsigned Content Regions
|
||||
|
||||
A PDF digital signature uses the `/ByteRange` entry in the signature dictionary to specify which bytes of the file the cryptographic digest covers. ByteRange is an array of four integers: offset and length of the range before the signature value, and offset and length of the range after it. The signature value itself — a hex-encoded blob — occupies the gap between those two ranges and is excluded from the digest.
|
||||
|
||||
When a signed PDF receives an incremental update, the appended bytes fall entirely outside the original signature's ByteRange. pdftract identifies which byte ranges are covered by each signature and which content was added afterward, allowing callers to distinguish content that was signed at a point in history from content added post-signing — a distinction critical in legal, compliance, and forensic extraction contexts.
|
||||
|
||||
## Repair Parsing for Broken Incremental Updates
|
||||
|
||||
Incremental updates can be malformed. A partially written update may have a corrupted xref section or a `startxref` pointing to the wrong offset. Network truncation, file system corruption, or authoring tool bugs can produce files where the normal chain-following algorithm fails before reaching all revisions.
|
||||
|
||||
pdftract's repair strategy is to fall back to a full-file object scan. The scanner reads linearly, identifying all byte sequences matching the pattern `N G obj`, recording each object's byte offset, and similarly locating all `xref` and `startxref` markers. From this inventory it reconstructs a best-effort object table, applying the same later-definition-wins rule. This recovers content from all update layers even when xref metadata is unreliable, at the cost of a full sequential read.
|
||||
|
||||
## Version History Extraction
|
||||
|
||||
pdftract can expose document history as a linear sequence of revisions, each containing the objects modified or added in that update. This is useful in forensic extraction — determining what a document said at a specific point in time — and in contract comparison workflows where counterparties exchanged multiple revisions of the same file rather than using a diff-capable format.
|
||||
|
||||
The performance tradeoff is real: constructing the complete object table is O(total xref entries across all revisions), which is manageable, but materializing each individual revision requires re-running the page content extraction pipeline once per revision. pdftract exposes revision history as a lazy iterator: callers can request a specific revision's content without materializing all others. For forensic use cases where every revision must be inspected, full linearization is available but documented as an expensive operation on deeply revised files.
|
||||
|
||||
## Summary
|
||||
|
||||
Correct extraction from incrementally updated PDFs requires following the complete `startxref` chain, merging all xref sections and xref streams in reverse chronological order, locating objects at raw byte offsets and inside compressed object streams, handling free-list entries without errors, and identifying byte regions covered by digital signatures. pdftract implements all of these mechanisms as prerequisites to content extraction, ensuring that form fills, annotations, and corrections appended as incremental updates are never silently omitted from output.
|
||||
95
docs/research/javascript-and-interactive-pdf-extraction.md
Normal file
95
docs/research/javascript-and-interactive-pdf-extraction.md
Normal file
|
|
@ -0,0 +1,95 @@
|
|||
# JavaScript, Interactive Elements, and Dynamic Content in PDFs
|
||||
|
||||
## Overview
|
||||
|
||||
Interactive PDFs present a unique extraction challenge: they contain content whose visible form depends on runtime behavior — JavaScript execution, form field state, and in some cases entirely XML-driven rendering. pdftract must correctly extract static field values and form content from these documents without executing any code and without misrepresenting computed or absent values as extracted text.
|
||||
|
||||
---
|
||||
|
||||
## PDF JavaScript: Structure and Extraction Implications
|
||||
|
||||
PDF supports JavaScript at multiple levels of the document structure, governed by ISO 32000 and the Adobe Acrobat JavaScript API extensions. At the document level, named JavaScript functions are registered in the document catalog's `/Names` tree under the `/JavaScript` key. Each entry maps a name string to a script object containing the source text. These functions may be called by event-driven actions elsewhere in the document.
|
||||
|
||||
JavaScript actions can also appear as `/OpenAction` in the document catalog, which triggers script execution when the file is opened. Individual pages carry `/AA` (Additional Actions) dictionaries with entries for page open (`/O`) and page close (`/C`) events. Form fields carry their own `/AA` dictionaries covering events such as keystroke (`/K`), validation (`/V` within `/AA`, distinct from the value key `/V` in the field dictionary), format (`/F`), and calculate (`/C`).
|
||||
|
||||
From pdftract's perspective, JavaScript has no effect on the byte-level content of the PDF. The page content streams, font tables, glyph sequences, and form field dictionaries are static data structures laid down at write time. JavaScript manipulates the viewer's runtime state — it can set field values, trigger document actions, or produce alert dialogs — but it cannot alter the bytes that pdftract reads during extraction. This means pdftract's correct posture is to treat JavaScript as inert: never execute it, never interpret it as extractable text, and never depend on it to understand field values.
|
||||
|
||||
However, pdftract must not crash when encountering JavaScript objects. A document containing `/JavaScript` in its names tree or `/OpenAction` pointing to a script action should be flagged as containing JavaScript (a metadata annotation on the extraction result), and those script objects should be silently skipped during content traversal.
|
||||
|
||||
---
|
||||
|
||||
## AcroForm Field Extraction
|
||||
|
||||
AcroForm is the traditional PDF interactive form model. Form fields are described by widget annotation dictionaries linked through the `/AcroForm` entry in the document catalog. Each field has a `/FT` (field type) key and a `/V` (value) key. The `/V` key holds the current field value as a PDF object — string, name, or array, depending on field type.
|
||||
|
||||
### Field Types
|
||||
|
||||
**Text fields** (`/Tx`) store their value as a UTF-16BE or PDFDocEncoding string in `/V`. Multiline fields may contain embedded newline characters. pdftract reads `/V` directly.
|
||||
|
||||
**Choice fields** (`/Ch`) represent dropdowns and listboxes. The selected value is stored in `/V` as the export value string, while the display label lives in the `/Opt` array. Each entry in `/Opt` is either a string (export value equals display label) or a two-element array `[export_value, display_label]`. pdftract should extract both the raw `/V` export value and the resolved display label by looking up `/Opt`.
|
||||
|
||||
**Button fields** (`/Btn`) encompass three distinct subtypes determined by the `/Ff` (field flags) bitmask. Bit 17 marks the field as a pushbutton. Bit 16 marks it as radio. A button with neither bit set is a checkbox. Pushbuttons carry no intrinsic value — they trigger actions when clicked and pdftract should record them as action elements with no value. Radio button groups store the name of the currently selected option in `/V` as a PDF Name object. Checkboxes store their state as a Name: `/Yes` (or a custom on-state name defined in the `/AP` (Appearance) dictionary) when checked, and `/Off` when unchecked. To determine what "on" means for a given checkbox, pdftract inspects the keys of the `/AP/N` (Normal appearance) sub-dictionary: the key that is not `/Off` is the on-state name.
|
||||
|
||||
**Signature fields** (`/Sig`) hold a signature dictionary in `/V` when signed, or are absent when unsigned. pdftract should note the presence or absence of a signature value without attempting to validate cryptographic content.
|
||||
|
||||
### Field Hierarchy and Name Construction
|
||||
|
||||
AcroForm fields are organized in a tree. Each field dictionary may contain a `/Kids` array pointing to child fields, and each field has a `/T` (partial name) key. The fully qualified field name is constructed by concatenating `/T` values from ancestor to descendant, separated by periods. Parent fields may omit `/FT` and serve only as containers for organizing children.
|
||||
|
||||
Certain attributes — particularly `/DA` (default appearance string, which specifies font and color for text rendering) and `/DR` (default resource dictionary) — are inherited downward through the hierarchy. pdftract must walk the ancestor chain to resolve these inherited values when they are absent from a leaf field dictionary.
|
||||
|
||||
### Rich Text in Text Fields
|
||||
|
||||
When a text field contains formatted content, the `/RV` (rich value) key holds rich text as an XML string conforming to a subset of XHTML. This XML uses `<span>` elements with `style` attributes to encode bold, italic, font size, and color. When `/RV` is present, pdftract should extract text from the `/RV` XML rather than from `/V`, since `/V` in this case is a plain-text fallback that loses all inline formatting. The `/RV` payload should be parsed as XML, with text content extracted from element nodes and formatting annotations preserved in the extraction metadata.
|
||||
|
||||
---
|
||||
|
||||
## Calculated Fields and the computed_empty State
|
||||
|
||||
Fields with a `/AA/C` (calculate) action have their displayed value determined by JavaScript at runtime. The static `/V` in the field dictionary reflects the last value that was written to disk — which may be the result of a previous rendering session, or may be entirely absent if the document was generated programmatically and never opened in an interactive viewer.
|
||||
|
||||
pdftract's extraction policy for calculated fields:
|
||||
|
||||
- If `/V` is present and non-empty, extract it and annotate the field as `calculated` in the output metadata.
|
||||
- If `/V` is absent or is an empty string, annotate the field as `computed_empty` and emit no value. This distinction is important for downstream consumers: a `computed_empty` field is not a blank field the user left unfilled — it is a field whose value requires JavaScript execution to produce.
|
||||
|
||||
pdftract must never attempt to evaluate the JavaScript in `/AA/C`. The calculation logic may depend on other field values, external data sources, or viewer-specific globals that are unavailable at extraction time.
|
||||
|
||||
---
|
||||
|
||||
## XFA: Static vs. Dynamic Forms
|
||||
|
||||
XFA (XML Forms Architecture) is an alternative form model that represents the form in an embedded XML stream. The `/XFA` key in the AcroForm dictionary points to this stream (or an array of name/stream pairs). XFA exists in two operational modes that have fundamentally different implications for extraction.
|
||||
|
||||
**Static XFA** renders the form layout to conventional PDF page content. The page content streams contain actual glyph sequences and positioned text, exactly as in a non-XFA document. pdftract extracts static XFA documents through the normal content stream pipeline, with no special handling required. The XFA XML is present but the page content is self-contained.
|
||||
|
||||
**Dynamic XFA** does not render to PDF page content at all. The page streams may be entirely empty or contain only placeholder content. The actual form fields, their layout, and their current values exist solely within the XFA XML. A document using dynamic XFA is essentially a PDF container around an XFA application. If pdftract attempts normal content-stream extraction on a dynamic XFA document, the result will be empty or misleading.
|
||||
|
||||
Detection: pdftract checks whether the document has an XFA stream. If it does, it examines whether the PDF page content streams contain substantive operator sequences. If the pages are empty and XFA is present, the document is classified as dynamic XFA. pdftract then parses the XFA XML directly, locating field nodes and their values within the XFA namespace. The XFA field value elements (typically `<field>` nodes with child `<value>` elements) are extracted and mapped to their XFA name paths, producing a structured field output analogous to AcroForm extraction.
|
||||
|
||||
---
|
||||
|
||||
## Submit and Reset Actions
|
||||
|
||||
Many interactive forms include submit (`/S` type `/SubmitForm` or `/JavaScript`) and reset (`/S` type `/ResetForm`) actions attached to pushbutton fields or to `/AA` entries. These actions are irrelevant to extraction — they describe what happens when the form is submitted, not what values the fields contain.
|
||||
|
||||
pdftract detects submit actions and annotates the document-level metadata to indicate that the form was designed for electronic submission. This is useful for downstream classification: a document flagged as submission-oriented may be a fillable form that was never completed, which informs how empty fields should be interpreted. No attempt is made to follow submit action targets or reconstruct submission payloads.
|
||||
|
||||
---
|
||||
|
||||
## Extraction Policy Summary
|
||||
|
||||
The following policy governs pdftract's handling of interactive PDF content:
|
||||
|
||||
- **JavaScript** is never executed. Documents containing JavaScript in `/JavaScript` names, `/OpenAction`, or field `/AA` entries are flagged with `contains_javascript: true` in extraction metadata.
|
||||
- **Static field values** (`/V`) are always extracted for all field types that carry meaningful values: text, choice, radio, and checkbox.
|
||||
- **Calculated fields** with a non-empty `/V` are extracted and annotated as `calculated`. Calculated fields with empty or absent `/V` are annotated as `computed_empty` with no value emitted.
|
||||
- **Rich text fields** with `/RV` use the XML payload as the authoritative text source.
|
||||
- **Choice fields** resolve export values against `/Opt` to provide display labels.
|
||||
- **Pushbutton fields** produce no value output; they are recorded as action widgets.
|
||||
- **Signature fields** are noted as signed or unsigned; cryptographic content is not extracted.
|
||||
- **Static XFA** is handled through normal page content extraction.
|
||||
- **Dynamic XFA** triggers XFA XML parsing; field values are extracted from XFA element nodes.
|
||||
- **JavaScript source text** is never emitted as extracted content, regardless of where it appears in the document structure.
|
||||
|
||||
This policy ensures that pdftract produces accurate, reproducible extraction results from interactive PDFs: static values are faithfully reported, absent computed values are honestly flagged, and the extraction pipeline remains deterministic regardless of what JavaScript the document author intended to run.
|
||||
197
docs/research/pdf-portfolio-and-attachments.md
Normal file
197
docs/research/pdf-portfolio-and-attachments.md
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
# PDF Portfolios, Collections, and Embedded File Extraction
|
||||
|
||||
**Project:** pdftract — Rust PDF text extraction library
|
||||
**Scope:** Portfolio detection, component enumeration, recursive extraction, ZUGFeRD/Factur-X, PDF/A-3 constraints, associated files, and output schema
|
||||
|
||||
---
|
||||
|
||||
## 1. Portfolio Detection and the `/Collection` Dictionary
|
||||
|
||||
A PDF Portfolio is a container document whose catalog carries a `/Collection` dictionary. This key is the definitive distinguisher between a plain PDF with attachments and a Portfolio: the presence of `Catalog → /Collection` signals that the embedded files are first-class component documents organized into a navigable collection, not supplementary attachments to a single document.
|
||||
|
||||
The `/Collection` dictionary contains several keys that describe the Portfolio's structure and presentation. `/Schema` defines the metadata columns displayed in the portfolio navigator UI. `/D` names the default component to open on launch — either a string key into the `EmbeddedFiles` name tree or the string `"__COVER_SHEET__"` indicating the cover page. `/View` specifies the preferred initial layout (`D` for details list, `T` for tile, `H` for hidden). `/Navigator` holds an indirect reference to a Filespec wrapping a separate Navigator PDF that provides the shell UI. `/Sort` carries default sort column and order.
|
||||
|
||||
The cover page — also called the navigator page — is a fully rendered PDF page that viewers display when no component is active. It is rendered from the Portfolio PDF's own page tree, not from any embedded file. For text extraction purposes, this page must be processed identically to any other PDF page: parse the content streams, resolve fonts, and extract glyph sequences. Its text contributes to the top-level document output, distinct from the extracted text of component files.
|
||||
|
||||
A PDF that lacks `/Collection` but contains an `EmbeddedFiles` name tree is a regular PDF with attachments. The extraction logic is similar, but the semantic framing differs: without `/Collection`, embedded files are supplementary to the parent document; within a Portfolio, they are the primary content.
|
||||
|
||||
---
|
||||
|
||||
## 2. Component File Enumeration via the `EmbeddedFiles` Name Tree
|
||||
|
||||
Regardless of whether `/Collection` is present, all document-level attachments are registered in the `EmbeddedFiles` name tree, reached via `Catalog → /Names → /EmbeddedFiles`. This is a PDF name tree — a balanced B-tree whose leaf nodes contain `(key, value)` pairs mapping string keys to indirect references to Filespec dictionaries.
|
||||
|
||||
Walking the tree requires handling two node types. An intermediate node carries `/Limits` (a two-element array with the first and last key in the subtree) and `/Kids` (an array of indirect references to child nodes). A leaf node carries `/Names` (a flat array alternating string keys and indirect references). The traversal is depth-first; collect all key/value pairs from every leaf.
|
||||
|
||||
Each value resolved from the tree is a Filespec dictionary. The fields relevant to enumeration are:
|
||||
|
||||
- `/F` — filename in PDFDocEncoding (legacy; always present)
|
||||
- `/UF` — Unicode filename in UTF-16BE (preferred when present; use over `/F`)
|
||||
- `/Desc` — human-readable description string
|
||||
- `/Type` (value `/Filespec`) — confirms the object type
|
||||
- `/CI` — collection item dictionary carrying per-column metadata values for Portfolio display
|
||||
- `/EF` — the embedded file stream sub-dictionary
|
||||
|
||||
The `/CI` dictionary maps column field names (as defined in `/Collection/Schema`) to their values for this component. For example, a Portfolio with a "Size" column and a "Date Modified" column will have corresponding entries in each component's `/CI`. These values are structured metadata that pdftract should capture as part of the attachment record, since they carry author-supplied organizational context.
|
||||
|
||||
MIME type is not stored in the Filespec but in the EmbeddedFile stream dictionary itself, described in §3 below. Date fields — creation and modification — appear in the EmbeddedFile stream's `/Params` sub-dictionary.
|
||||
|
||||
---
|
||||
|
||||
## 3. Component File Access via the `/EF` Stream Dictionary
|
||||
|
||||
The `/EF` (embedded file) key within a Filespec maps platform filename variants to indirect references to EmbeddedFile stream objects. Modern PDFs use `/F` and `/UF` pointing to the same stream object; the legacy platform-specific keys (`/DOS`, `/Mac`, `/Unix`) should be handled for compatibility but are rarely present in contemporary portfolios.
|
||||
|
||||
The EmbeddedFile stream dictionary carries:
|
||||
|
||||
- `/Subtype` — a MIME type string (e.g., `application/pdf`, `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`, `application/xml`, `text/csv`). This field is optional but common; absent values require MIME detection from content.
|
||||
- `/Filter` — the decompression filter or array of filters applied to the stream body. `FlateDecode` is nearly universal; multi-stage chains like `[/ASCII85Decode /FlateDecode]` appear in older files.
|
||||
- `/Length` — the compressed byte count within the file.
|
||||
- `/Params` — a sub-dictionary carrying `/Size` (decompressed byte count, usable as a sanity check), `/CreationDate`, `/ModDate` (PDF date strings in the format `D:YYYYMMDDHHmmSSOHH'mm'`), and `/CheckSum` (16-byte MD5 digest of the uncompressed content).
|
||||
|
||||
To extract raw bytes: locate the stream object, apply the `/Filter` chain in sequence (each filter in array order operates on the output of the preceding one), and the resulting byte sequence is the uncompressed file payload. The decompressed length should equal `/Params/Size`; a mismatch indicates corruption or a miscalculated filter chain.
|
||||
|
||||
File types typically embedded in Portfolios include PDF documents (nested portfolios or standalone reports), Office Open XML formats (Word `.docx`, Excel `.xlsx`, PowerPoint `.pptx`), legacy Office formats (`.doc`, `.xls`), XML data files, CSV spreadsheets, and plain text. All non-PDF types should be surfaced in the output with their bytes available for caller retrieval; PDF types trigger recursive processing (§5).
|
||||
|
||||
---
|
||||
|
||||
## 4. Portfolio Schema: Extractable Structured Metadata
|
||||
|
||||
The `/Collection/Schema` dictionary defines the columns that the portfolio viewer displays. Each entry maps a field name (a PDF name object) to a field descriptor dictionary with these keys:
|
||||
|
||||
- `/E` — the display label string (e.g., `"File Name"`, `"Description"`, `"Date Created"`)
|
||||
- `/T` — the field type: `/S` (string), `/D` (date), `/N` (number), `/F` (filename — a special case of string)
|
||||
- `/O` — display order (integer; lower values appear first in the UI column list)
|
||||
- `/V` — visibility flag (boolean; `false` means the field exists but is hidden in the default view)
|
||||
|
||||
This schema is machine-readable structured metadata that pdftract can surface as part of the portfolio-level output. A caller processing a portfolio of financial reports can use the schema to understand what metadata columns exist, then read each component's `/CI` dictionary values against those column definitions to construct a structured table of all component metadata without opening any embedded files.
|
||||
|
||||
The schema extraction path is: `Catalog → /Collection → /Schema → iterate each key/value pair → record field name, label, type, order, and visibility`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Recursive Portfolio Extraction: Depth Limiting and Cycle Detection
|
||||
|
||||
When an embedded component's MIME type is `application/pdf` or its first four bytes are `%PDF`, the component is itself a PDF and must be parsed as a standalone document. This recursion is essential for portfolios that bundle other portfolios as components, a pattern found in document packages where a top-level portfolio indexes sub-portfolios grouped by topic or date.
|
||||
|
||||
pdftract must enforce a configurable recursion depth limit, with a default of three levels. At the limit, the component is recorded in the attachment list with `extraction_status: "skipped"` and a `recursion_limit_reached` flag, but its bytes are not parsed. This prevents stack exhaustion and memory overconsumption from adversarially nested PDFs.
|
||||
|
||||
Cycle detection requires tracking the MD5 or SHA-256 digest of each PDF payload encountered during a single extraction job. Before recursing into a component, compute the digest of its decompressed bytes and check against the seen-digests set for the current traversal. If the digest is already present, record the component as `extraction_status: "skipped"` with a `cycle_detected` flag. The digest set must be passed down through recursive calls, not maintained as global state, so that independent top-level extraction jobs do not share state.
|
||||
|
||||
Each recursively parsed PDF is a fully independent document: it has its own cross-reference table, object numbering, and name trees. Do not share any object cache or font cache across recursion levels.
|
||||
|
||||
---
|
||||
|
||||
## 6. ZUGFeRD and Factur-X Invoice PDFs
|
||||
|
||||
ZUGFeRD (Germany) and Factur-X (France/EU) are electronic invoicing profiles built on PDF/A-3 (ISO 19005-3). The document is simultaneously a human-readable PDF invoice and a machine-readable structured data package. The XML payload embedded within conforms to EN 16931 (the European e-invoicing standard) using the UN/CEFACT Cross Industry Invoice (CII) data model.
|
||||
|
||||
Detection requires checking multiple indicators in combination:
|
||||
|
||||
1. `Catalog → /AF` array is present (mandatory in PDF/A-3).
|
||||
2. The `EmbeddedFiles` name tree contains a Filespec whose `/UF` or `/F` value matches `factur-x.xml` (Factur-X) or `zugferd-invoice.xml` / `ZUGFeRD-invoice.xml` (ZUGFeRD 1.x). ZUGFeRD 2.x aligns with Factur-X and uses `factur-x.xml`.
|
||||
3. The matching Filespec has `AFRelationship /Data`.
|
||||
4. The EmbeddedFile stream's `/Subtype` is `application/xml` or `text/xml`.
|
||||
5. The XMP metadata stream on the catalog contains `pdfaid:part = 3` confirming PDF/A-3 conformance.
|
||||
|
||||
For these documents, pdftract has two distinct extraction targets: the visual text of the PDF pages (the human-readable invoice rendition) and the raw XML bytes of the embedded file (the machine-readable invoice data). Both targets should appear in the output. The XML bytes should be exposed in the `attachments` array entry for the embedded file. Callers processing invoices in bulk will often prefer the XML path, but the page text remains valuable for validation and fallback.
|
||||
|
||||
---
|
||||
|
||||
## 7. PDF/A-3 Attachment Constraints and `AFRelationship` Prioritization
|
||||
|
||||
PDF/A-3 (ISO 19005-3) is the only PDF/A conformance level that permits embedding arbitrary file formats. Lower levels (PDF/A-1, PDF/A-2) prohibit embedded files entirely. When a document declares PDF/A-3 conformance in its XMP metadata (`pdfaid:part = 3`), all attachments must carry an `AFRelationship` value — `Unspecified` is the fallback for attachments without a declared semantic role.
|
||||
|
||||
The `AFRelationship` value directly informs extraction priority:
|
||||
|
||||
- `Data` and `Source` indicate the attachment is structured data either generated from or used to generate this PDF. These are the highest-priority extraction targets because they carry non-redundant information unavailable from the page text.
|
||||
- `Alternative` indicates a different representation of the document content — useful when the PDF page text is degraded or encoded with poor font mapping.
|
||||
- `Supplement` indicates ancillary information that augments the document.
|
||||
- `Unspecified` is the lowest priority; the attachment's value must be inferred from MIME type and filename.
|
||||
|
||||
pdftract should sort the `attachments` array by this priority order when presenting results, and should tag each attachment record with its `af_relationship` string for caller-side filtering.
|
||||
|
||||
---
|
||||
|
||||
## 8. ISO 32000-2 Associated Files on Pages, Fields, and XObjects
|
||||
|
||||
PDF 2.0 (ISO 32000-2) generalizes the association between files and document objects via the `/AF` (associated files) array. This array can appear on the document catalog, on individual page dictionaries, on form field objects, on XObject dictionaries, and on structure elements in tagged PDFs.
|
||||
|
||||
Each entry in an `/AF` array is an indirect reference to a Filespec dictionary. When `/AF` appears on a page, the associated file relates specifically to that page's content — for example, a transcript of audio described on that page, or a data table whose values are visualized in a chart on that page. When `/AF` appears on an XObject, the association is with a specific figure or image element. When `/AF` appears on a form field, it carries data submitted with or relevant to that field.
|
||||
|
||||
During page iteration for text extraction, pdftract must collect `/AF` entries from each page dictionary and merge them with any document-level `/AF` entries. During XObject resolution, if the XObject dictionary carries `/AF`, those Filespecs should be recorded with the containing page number and XObject name as context. Deduplication by PDF object number is required since the same Filespec can be referenced from multiple `/AF` arrays across the document.
|
||||
|
||||
The practical impact on text extraction: a page with an associated file carrying `AFRelationship /Alternative` may contain image-only content where the associated file is the text alternative. Surfacing this relationship allows callers to fall back to the associated text when OCR is unavailable or unreliable.
|
||||
|
||||
---
|
||||
|
||||
## 9. Cover Page Text Extraction
|
||||
|
||||
The cover or navigator page of a PDF Portfolio is a regular PDF page rendered by the containing PDF's page tree. It is not an embedded file. Viewers display it as the initial landing page of the portfolio — it typically contains the portfolio title, a description, and branding elements.
|
||||
|
||||
From pdftract's perspective, the cover page is structurally identical to any other PDF page. Its content streams must be parsed, its fonts resolved, and glyph sequences mapped to Unicode following the standard extraction pipeline. The resulting text contributes to the top-level document's page output, tagged with its page index.
|
||||
|
||||
The only Portfolio-specific consideration is that when `/Collection/D` equals `"__COVER_SHEET__"` or a similar sentinel, the intent is that the cover page is the default view — this is a presentation hint only and does not affect extraction. Extract all pages in the parent PDF's page tree regardless of `/Collection/D`.
|
||||
|
||||
---
|
||||
|
||||
## 10. Output Schema for Portfolios
|
||||
|
||||
The pdftract JSON output for a portfolio document must surface both the parent document's text and the structured attachment list. For embedded PDFs processed recursively, the nested extraction result appears inline.
|
||||
|
||||
```json
|
||||
{
|
||||
"pages": [ { "page": 0, "text": "Portfolio cover page text..." } ],
|
||||
"portfolio": true,
|
||||
"attachments": [
|
||||
{
|
||||
"filename": "Q1-Report.pdf",
|
||||
"mime_type": "application/pdf",
|
||||
"size_bytes": 204800,
|
||||
"description": "Q1 Financial Report",
|
||||
"af_relationship": "Data",
|
||||
"extraction_status": "extracted",
|
||||
"nested_result": {
|
||||
"pages": [ { "page": 0, "text": "..." } ],
|
||||
"portfolio": false,
|
||||
"attachments": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"filename": "factur-x.xml",
|
||||
"mime_type": "application/xml",
|
||||
"size_bytes": 14230,
|
||||
"description": "Factur-X structured invoice",
|
||||
"af_relationship": "Data",
|
||||
"extraction_status": "extracted",
|
||||
"nested_result": null
|
||||
},
|
||||
{
|
||||
"filename": "archive.pdf",
|
||||
"mime_type": "application/pdf",
|
||||
"size_bytes": 10485760,
|
||||
"description": null,
|
||||
"af_relationship": "Unspecified",
|
||||
"extraction_status": "skipped",
|
||||
"skip_reason": "recursion_limit_reached",
|
||||
"nested_result": null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Field definitions:
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `portfolio` | boolean | `true` if `Catalog → /Collection` was present. |
|
||||
| `filename` | string | From `/UF`; falls back to `/F`. |
|
||||
| `mime_type` | string or null | From EmbeddedFile `/Subtype`; null if absent. |
|
||||
| `size_bytes` | integer or null | From `EmbeddedFile/Params/Size`; null if absent. |
|
||||
| `description` | string or null | From Filespec `/Desc`. |
|
||||
| `af_relationship` | string or null | String value of `AFRelationship`; null if not declared. |
|
||||
| `extraction_status` | string | `"extracted"`, `"skipped"`, or `"error"`. |
|
||||
| `skip_reason` | string or null | Present when `extraction_status` is `"skipped"`; values: `"recursion_limit_reached"`, `"cycle_detected"`, `"size_limit_exceeded"`. |
|
||||
| `nested_result` | object or null | Full extraction result for embedded PDFs when `recursive: true`; null for non-PDF attachments or skipped entries. |
|
||||
|
||||
The `portfolio` boolean at the top level allows callers to distinguish a portfolio response from a regular document response without inspecting the `attachments` array. When `portfolio` is `true`, callers should treat the top-level `pages` text as the cover/navigator content and the `attachments` entries as the primary documents.
|
||||
Loading…
Add table
Reference in a new issue