From a89fef64fc1537ac55249f83f3ffeec7f3128ccc Mon Sep 17 00:00:00 2001 From: jedarden Date: Sat, 16 May 2026 16:04:00 -0400 Subject: [PATCH] Add research: article threads, resource dictionaries, catalog, hyperlinks Four new extraction research documents covering PDF article thread traversal for multi-flow magazine layouts, resource dictionary inheritance and ResourceStack semantics for nested Form XObjects, document catalog and page tree structure (UserUnit, Contents array, page inheritance), and hyperlink/named destination extraction with QuadPoints anchor text and link density classification. Co-Authored-By: Claude Sonnet 4.6 --- .../article-threads-and-reading-order.md | 77 ++++++++++++ .../document-catalog-and-structure.md | 117 ++++++++++++++++++ .../hyperlinks-and-named-destinations.md | 75 +++++++++++ .../resource-dictionary-and-inheritance.md | 86 +++++++++++++ 4 files changed, 355 insertions(+) create mode 100644 docs/research/article-threads-and-reading-order.md create mode 100644 docs/research/document-catalog-and-structure.md create mode 100644 docs/research/hyperlinks-and-named-destinations.md create mode 100644 docs/research/resource-dictionary-and-inheritance.md diff --git a/docs/research/article-threads-and-reading-order.md b/docs/research/article-threads-and-reading-order.md new file mode 100644 index 0000000..ef701c9 --- /dev/null +++ b/docs/research/article-threads-and-reading-order.md @@ -0,0 +1,77 @@ +# Article Threads, Reading Order Override, and Multi-Page Flow + +## Overview + +PDF article threads are one of the oldest and least-implemented features of the format, introduced in PDF 1.1 as a mechanism to guide readers through content that flows across non-contiguous page regions. For tools focused on extracting text in logical reading order, article threads represent both an underused signal and a practical necessity when processing magazine-style, newsletter, or column-heavy documents where page-by-page extraction produces fragmented, incoherent output. + +## The /Threads Structure in the Document Catalog + +The document catalog — the root object of every PDF — may contain a `/Threads` array. Each entry in this array is a Thread dictionary, representing a single logical article or content flow. The Thread dictionary has two keys: `/F`, which points to the first Bead dictionary in the thread's linked list, and an optional `/I` (info) dictionary that carries metadata such as the article's title and an identifier. + +Bead dictionaries are the atomic units of a thread. Each Bead has five relevant keys: + +- `/T` — a reference back to the enclosing Thread dictionary (used for traversal integrity checks) +- `/N` — the next Bead in sequence; this is a direct object reference, not an index +- `/V` — the previous Bead (doubly linked for bidirectional traversal) +- `/P` — a reference to the Page object on which this bead's content appears +- `/R` — a rectangle in page default user space, specifying exactly which region of that page contains this bead's content + +To traverse a thread, pdftract starts at `/F` and follows the `/N` chain until it circles back to the first bead (the list is circular) or until `/N` is null, depending on the authoring tool. The termination condition must be implemented defensively: track visited bead object numbers to avoid infinite loops in malformed files, and treat a null `/N` or a bead already seen as the end of the sequence. + +## Why Threads Matter: Magazine Flow Across Non-Consecutive Columns + +In a typical magazine layout, a single article might begin in columns one and two on page three, jump to a sidebar region on page seven, and conclude on page nine's inner column. A page-by-page extractor — even one that correctly orders text by reading direction within a single page — will interleave that article with every other article that appears on pages three, seven, and nine. The result is unusable for downstream processing, summarization, or indexing. + +Article threads encode the editorial intent directly in the file. When a publisher's layout application writes threads, it is asserting that the text within bead rectangle R1 on page P1, followed by the text within bead rectangle R2 on page P2, is a single coherent unit. This is information that cannot be recovered from glyph positions alone without a full layout analysis engine. pdftract must extract and honor this signal when it is present. + +## Extracting Text from Bead Rectangles + +For each bead, the extraction process is as follows. First, resolve the `/P` reference to obtain the Page object. Then, apply the page's coordinate transform — accounting for `/MediaBox`, `/CropBox`, any `/Rotate` value, and the CTM established by the page's content stream — to bring the bead's `/R` rectangle into the same space used during text operator processing. The `/R` value is specified in page default user space before rotation, so the same coordinate normalization logic used for text glyph positions must be applied. A bead rectangle specified as `[x1 y1 x2 y2]` in user space must be tested against the bounding box of each text span on the page after both the text matrix and the current transformation matrix have been applied to the glyph's origin. + +For each text span or glyph cluster on the page, test whether the glyph's position falls within the transformed bead rectangle. Collect all qualifying spans, sort them by the normal reading order for that page (top-to-bottom, then left-to-right for LTR scripts), and concatenate them into the bead's text contribution. Repeat for every bead in thread order. The concatenation of bead texts, in `/N`-chain sequence, produces the reconstructed article. + +Word boundaries that straddle the bead rectangle edge require care. If a glyph cluster partially overlaps the rectangle, pdftract should include it if the glyph's reference point (typically the lower-left of the advance rectangle) falls within the bounds. Half-pixel tolerance may be appropriate to handle floating-point imprecision in coordinate storage. + +## Multiple Threads per Document and Overlapping Regions + +A single PDF document may contain many independent threads, each representing a separate article. pdftract must iterate the entire `/Threads` array and process each thread independently. The output structure should maintain strict separation between thread content and page-body content. + +A single page region may be covered by beads from more than one thread — a common pattern for pull quotes, sidebars, and boxed callouts that are simultaneously part of the surrounding article flow and their own thematic thread. When a glyph falls within rectangles belonging to multiple threads, pdftract should assign it to all matching threads. The downstream consumer — not the extractor — is better positioned to decide how to handle the overlap, whether to deduplicate or to preserve the multiple associations. + +Text on a page that does not fall within any bead rectangle is classified as page body content and is extracted under the normal page extraction pathway. This partitioning allows pdftract to produce both a threads-based view and a page-based view from the same document without discarding content that the thread definitions do not cover. + +## SpiderInfo and Thread Metadata + +The optional `/I` dictionary attached to a Thread — sometimes called the SpiderInfo dictionary — may carry a `/Title` string and an `/ID` string. These were originally intended to support web spider crawlers during the era when PDF files were indexed by content-aware search agents. Today, they provide a practical mechanism for extracting article metadata: a `/Title` of "Feature: The Architecture of Memory" attached to a thread lets pdftract label the extracted content without heuristic guessing. + +When `/I` is present and `/Title` exists, pdftract should use it as the `title` field in the thread output object. When absent, the field should be null rather than a synthesized value. The `/ID` field, when present, is a byte string and may be used as a stable identifier if the document is re-extracted; it can populate the `thread_id` field in the output schema. + +## Output Schema for Thread Content + +pdftract's JSON output should include a top-level `threads` array, present when the document contains at least one thread. Each entry in the array is an object with the following fields: + +- `thread_id` — the `/ID` value from `/I` if present, otherwise the zero-based index of the thread in the `/Threads` array +- `title` — the `/Title` value from `/I` if present, otherwise null +- `bead_text` — an array of strings, one per bead in traversal order, each containing the extracted text for that bead's rectangle + +This schema makes the bead structure visible to consumers who need to understand the original flow segmentation, while also allowing simple concatenation of `bead_text` entries for consumers that only want the article as a flat string. Page and line metadata for each bead can be added as parallel arrays without breaking existing consumers. + +## Priority: Tagged PDF vs. Article Threads + +When a document carries both a tag tree (Structure Tree Root with semantic tags such as `
`, `

`, ``) and article threads, the structure tree is the authoritative source for reading order. Tagged PDF was designed precisely to encode logical document structure in a machine-readable form, and modern accessibility-compliant exports rely on it. Article threads are a parallel mechanism that predates tagged PDF by several versions and carries less semantic granularity. + +pdftract's extraction pipeline should therefore apply the following priority: if a document is tagged, use the structure tree as the primary reading order source and treat article threads as supplementary metadata. If a document is untagged but has article threads, elevate the threads to the primary ordering mechanism for multi-flow content. If a document has neither, fall back to geometric heuristics for column detection and reading order reconstruction. + +This fallback chain should be detectable and exposed in pdftract's output as an `extraction_strategy` field at the document level, allowing consumers to understand which mechanism was used and calibrate their confidence accordingly. + +## Coordinate Transforms and Precision + +The `/R` rectangle in a Bead dictionary is expressed in page default user space. This is the coordinate system that exists before the page's `/Rotate` entry is applied. When a page has a 90-degree or 270-degree rotation — common in landscape-oriented magazine pages — the bead rectangles must be rotated by the same angle before being compared against glyph positions, which are computed in the post-rotation display space. Failing to apply this transform will cause bead regions to miss all the text they should capture. + +The page's `/MediaBox` origin must also be subtracted before comparison if it does not start at `[0 0]`. Some authoring tools produce non-zero media box origins, and bead rectangles are specified relative to the media box origin. The same normalization applied to glyph coordinates in the main extraction path must be applied identically to bead rectangle coordinates to ensure consistent intersection testing. + +## Graceful Fallback for Legacy and Modern Documents + +Article threads are a PDF 1.1 feature with a long history, but modern layout applications such as InDesign have largely abandoned them in favor of tagged PDF for accessibility compliance. A document produced by a current version of InDesign will typically contain a rich structure tree and no `/Threads` array at all. pdftract must handle the absent-threads case without error: if the document catalog does not contain a `/Threads` key, or if the array is empty, the `threads` field in the output is either omitted or set to an empty array, and extraction proceeds through the normal pathways. + +For documents that do contain threads — primarily older magazine PDFs, documents from legacy publishing workflows, and some scanned-and-OCR'd publications where threads were added post-hoc by a PDF processing tool — pdftract's thread extraction provides the only reliable way to recover the intended reading order without reimplementing a full layout analysis engine. Detecting presence, traversing the linked list safely, applying correct coordinate transforms, and partitioning page content by bead coverage are the four implementation requirements that make thread-aware extraction work correctly for this class of document. diff --git a/docs/research/document-catalog-and-structure.md b/docs/research/document-catalog-and-structure.md new file mode 100644 index 0000000..3a31179 --- /dev/null +++ b/docs/research/document-catalog-and-structure.md @@ -0,0 +1,117 @@ +# PDF Document Catalog, Page Tree, and Document-Level Structure + +## Overview + +Before pdftract can extract a single glyph, it must correctly parse the document catalog and traverse the page tree. These structures are the authoritative source of truth for every resource, every page coordinate system, and every inherited property that governs text extraction. A failure at this stage propagates silently into incorrect bounding boxes, missing pages, or misinterpreted coordinate spaces. This document describes what the PDF specification defines at the catalog and page-tree level, and what pdftract must correctly handle in each area. + +--- + +## The Document Catalog + +The document catalog is the root object of the PDF document's logical structure. The cross-reference table's trailer dictionary contains a `/Root` key whose value is an indirect reference to the catalog object. The catalog is always of type `/Catalog`. + +The only strictly required keys in the catalog are `/Type` (which must equal `/Catalog`) and `/Pages` (an indirect reference to the root of the page tree). Every other key is optional, but several are extraction-relevant. + +**`/Outlines`** is a reference to the document outline (bookmarks). pdftract uses this to reconstruct hierarchical section structure when named destinations within the outline correspond to page ranges, which assists in logical document segmentation. + +**`/PageLabels`** is a number tree mapping page indices to label dictionaries. Each label can specify a numbering style (Arabic, Roman, alphabetic), a label prefix, and a starting value. pdftract records these to produce human-readable page identifiers in its output that match the labels visible in a viewer, rather than raw zero-based indices. + +**`/Names`** is a dictionary of named trees covering multiple categories, detailed separately below. It is one of the most complex optional structures in the catalog. + +**`/Dests`** is a legacy dictionary of named destinations that maps name strings directly to destination arrays. This was superseded by `/Names /Dests` in PDF 1.2, but pdftract must handle both forms since many documents were produced by older generators. Named destinations that target specific pages are informational for extraction and are recorded as metadata. + +**`/AcroForm`** is a reference to the interactive form dictionary. For text extraction purposes, pdftract processes form field values, particularly text field appearances, as potential content streams that supplement the page's `/Contents`. + +**`/Metadata`** is a reference to a metadata stream in XMP format. pdftract reads this stream to populate its document-level metadata output (title, author, creation date, language) before any page content is processed. + +**`/StructTreeRoot`** is a reference to the document's structure tree. pdftract only attempts to use this when `/MarkInfo` indicates the document is properly tagged, as detailed below. + +**`/Lang`** specifies the natural language of the document as a BCP 47 language tag. pdftract propagates this as the default language for text spans that do not override it at the structure element or marked-content level. + +**`/OutputIntents`** is an array of output intent dictionaries used in PDF/X and PDF/A documents. While primarily a color management concern, pdftract records the conformance level (e.g., PDF/A-2b) in its output metadata because it constrains what content structures are legal in the file. + +--- + +## The /MarkInfo Dictionary + +The `/MarkInfo` dictionary in the catalog signals whether the document has been authored as a tagged PDF. Its three keys are `/Marked`, `/UserProperties`, and `/Suspects`. + +**`/Marked`** is the critical flag. When `true`, the document author asserts that all content is associated with marked-content sequences and that a complete structure tree exists in `/StructTreeRoot`. pdftract uses this flag as the gate for structure-tree extraction: if `/Marked` is `false` or absent, pdftract falls back to heuristic reading-order analysis rather than traversing the structure tree, because an incomplete structure tree is worse than no tree at all. + +**`/Suspects`**, when `true`, means the document author acknowledges that the structure tree may be incorrect or incomplete, even though `/Marked` is `true`. pdftract treats a document with both `/Marked true` and `/Suspects true` as untrustworthy for structure-guided extraction, applying the same heuristic fallback used for untagged documents. + +**`/UserProperties`**, when `true`, indicates that structure elements may carry user-defined property lists in their `/A` attribute dictionaries. pdftract records these as opaque metadata on the corresponding output spans. + +--- + +## The Page Tree + +The `/Pages` entry in the catalog references the root of a balanced tree of page objects. This tree has two kinds of nodes. Intermediate nodes have `/Type /Pages` and contain a `/Kids` array of references to child nodes (which may themselves be intermediate or leaf nodes) and a `/Count` integer giving the total number of leaf page nodes in the subtree. Leaf nodes have `/Type /Page` and contain the actual page attributes. + +Traversal is a straightforward depth-first walk: start at the root node, and for each node encountered, if its `/Type` is `/Pages`, recurse into each entry in `/Kids` in order. If its `/Type` is `/Page`, emit it as the next page in the enumeration sequence. The `/Count` value on intermediate nodes is used by pdftract to pre-allocate output structures and to validate that the traversal found the expected number of leaves, but is never used to skip subtrees. pdftract always performs a full traversal; it does not trust `/Count` to be accurate in malformed documents. + +--- + +## Page Dictionary Keys + +Each leaf page node carries a set of keys governing its geometry and content. + +**`/MediaBox`** is the only required geometry key on a page (or an ancestor node via inheritance). It defines the full physical extent of the page in default user space units. All content coordinates are interpreted relative to this space. + +**`/CropBox`**, **`/BleedBox`**, **`/TrimBox`**, and **`/ArtBox`** define progressively tighter regions within the media box. For text extraction, pdftract clips output text positions to the `/CropBox` if present, discarding glyphs whose positions fall outside it, since those glyphs are not visible in normal viewing. + +**`/Rotate`** is an integer multiple of 90, specifying a clockwise rotation applied to the page before display. pdftract applies this rotation when computing the final bounding boxes of extracted text runs so that reported coordinates are in the oriented (viewer-facing) coordinate space. + +**`/Resources`** is a dictionary of resource dictionaries (fonts, XObjects, color spaces, and so on) available to the page's content streams. This is the entry point for font resolution during text extraction. + +**`/Contents`** is either a single indirect reference to a content stream or an array of such references. See the dedicated section below. + +**`/Annots`** is an array of annotation dictionaries. For extraction, pdftract processes widget annotations (form fields) and link annotations (to record destination metadata), but ignores display-only annotation types. + +**`/StructParents`** is an integer key that pdftract uses when walking the structure tree in reverse: it identifies which entry in the parent tree corresponds to this page, allowing structure elements to be matched to their page. + +--- + +## /UserUnit + +Introduced in PDF 1.6, `/UserUnit` is a positive real number on a page dictionary that scales the default user space. The default user space unit is 1/72 of an inch. A `/UserUnit` value of `2.0` means each user space unit represents 2/72 of an inch. When `/UserUnit` is absent, its value is exactly `1.0`. + +pdftract must apply this scale factor to every coordinate derived from the page, including text positions extracted from content streams, glyph bounding boxes computed from font metrics, and the geometry boxes (`/MediaBox`, `/CropBox`, etc.). Failure to apply `/UserUnit` produces coordinates that are numerically correct in PDF user space but physically incorrect when converted to physical units. The scale is applied after coordinate extraction, as a final multiplication before output. + +--- + +## /Contents: Single Stream vs. Array + +A page's content is described by one or more PDF streams referenced from `/Contents`. When the value is a single indirect reference, pdftract decodes and processes that stream. When the value is an array, each element is an indirect reference to a separate stream. The specification requires that these streams be treated as if concatenated into a single stream before parsing, meaning the graphics state is continuous across stream boundaries — an operator in one stream can depend on state established in a preceding stream. + +pdftract decodes each stream separately (to handle per-stream compression filters independently) but feeds them to a single graphics state machine in order. Inserting an implicit whitespace character between streams is required because some generators split streams mid-token, relying on the concatenation to reconstruct valid syntax. pdftract inserts a single ASCII space between decoded stream buffers before tokenizing. + +--- + +## The /Names Dictionary + +The `/Names` entry in the catalog is a dictionary of named trees organized by category. Each value is a name tree (a balanced B-tree of key-value pairs with string keys). The categories relevant to pdftract are: + +**`/Dests`** maps destination names to destination arrays or dictionaries. pdftract resolves these when it encounters named-destination references in annotations or outlines. + +**`/EmbeddedFiles`** maps names to file specification dictionaries. pdftract records these as attachments in its document metadata output and, when configured, can extract embedded file content alongside the main text. + +**`/JavaScript`** is informational only; pdftract records the presence of JavaScript but does not execute it. + +--- + +## /ViewerPreferences and /OpenAction + +The `/ViewerPreferences` dictionary controls how a viewer renders the document. Most of its keys (toolbar visibility, window fitting) are irrelevant to extraction. The one key pdftract reads is **`/Direction`**: when set to `/R2L`, it signals that the document's page progression and text base direction are right-to-left. pdftract uses this as a document-level hint for bidirectional text processing. + +The `/OpenAction` specifies an action or destination to activate when the document is opened. For extraction, this is purely informational: pdftract records the action type and any associated destination as document metadata (for example, to note that the document opens at a specific named destination), but it does not execute the action. + +--- + +## Page Inheritance + +The PDF page tree supports property inheritance: the keys `/MediaBox`, `/CropBox`, `/Resources`, and `/Rotate` may be defined on any intermediate `/Pages` node and are inherited by all descendant page nodes that do not define the key themselves. + +When pdftract processes a leaf page, it resolves each inheritable key by walking up the parent chain — using the `/Parent` reference present on every non-root node — until it finds the nearest ancestor that defines the key. The walk stops at the catalog root if no ancestor defines the key (in which case no value exists for `/CropBox`, and the key is absent, or `/Rotate` defaults to 0). The resolved values are cached per page during the initial tree traversal so the parent walk is not repeated during content stream processing. + +This inheritance mechanism means pdftract cannot resolve a page's complete attribute set from the leaf node alone. The full tree traversal that enumerates pages must simultaneously accumulate inherited values, threading them down through intermediate nodes and recording the resolved set at each leaf. pdftract performs this accumulation in a single top-down pass over the page tree, storing the resolved page descriptor before any content decoding begins. diff --git a/docs/research/hyperlinks-and-named-destinations.md b/docs/research/hyperlinks-and-named-destinations.md new file mode 100644 index 0000000..4c0e7b4 --- /dev/null +++ b/docs/research/hyperlinks-and-named-destinations.md @@ -0,0 +1,75 @@ +# Hyperlinks, Named Destinations, and Internal Navigation Structure + +## Overview + +PDF documents support a rich hyperlink model built on top of the annotation and action systems. Extracting hyperlinks faithfully requires understanding three distinct but interlocking subsystems: the Link annotation that marks a clickable region on the page, the action or destination dictionary that specifies the link target, and the document catalog's named destination index that resolves symbolic names to physical page locations. pdftract must implement all three layers to produce accurate link records for every annotation type used for navigation in real-world PDF files. + +## Link Annotations and Spatial Extraction of Anchor Text + +A PDF hyperlink begins with a `/Subtype /Link` annotation object in the page's `/Annots` array. The annotation carries a `/Rect` entry — an array of four numbers in page user space `[llx lly urx ury]` — that defines the bounding box of the clickable region. The coordinate system is the standard PDF one: origin at lower-left, y increasing upward, units in points. + +To recover anchor text, pdftract intersects the annotation's `/Rect` with the text spans extracted from the page's content stream. A text span intersects if its bounding box overlaps the annotation rectangle with sufficient coverage — in practice a threshold of roughly 50% overlap by area is reliable for single-line links, though more precise matching uses the span's individual glyph positions. The concatenated text of all intersecting spans, in reading order, forms the anchor text reported in the output. + +Additional annotation keys shape how the annotation presents visually but are not required for text extraction. `/Border` (a three-element array: horizontal corner radius, vertical corner radius, line width) and the newer `/BS` (border style dictionary with `/W` width and `/S` style) control the visible border; pdftract records border presence as metadata. `/H` is the highlight mode — `/I` (invert), `/O` (outline), `/P` (push), or `/N` (none) — which determines the visual response to a click and is otherwise ignored during extraction. + +## URI Actions + +When the annotation's `/A` dictionary has `/S /URI`, the link points to an external URL. The `/URI` key holds a byte string containing the URL, encoded as ASCII or UTF-8 depending on the producer. pdftract decodes it as UTF-8 with a fallback to Latin-1 for legacy files. The optional `/IsMap` boolean, when true, signals that the annotation is part of a server-side image map; the coordinates of the click are appended to the URL as a query string by the viewer. pdftract records this flag in the output but does not modify the extracted URL, since the coordinate-appending behavior is a viewer responsibility. For URI actions, the output `link_type` is `uri` and the `url` field carries the decoded string; anchor text comes from the spatial intersection described above. + +## GoTo Actions and Internal Link Resolution + +When `/A` carries `/S /GoTo`, or when the annotation has a direct `/Dest` key instead of an action dictionary, the link is internal — it targets a page within the same document. The destination is identified either by an explicit destination array or by a named destination string or name object that must be resolved through the catalog. + +An explicit destination array has the form `[page_ref /XYZ left top zoom]`, `[page_ref /Fit]`, `[page_ref /FitH top]`, `[page_ref /FitV left]`, `[page_ref /FitB]`, `[page_ref /FitBH top]`, or `[page_ref /FitBV left]`. The first element is always an indirect reference to the target page object. pdftract resolves this reference against the cross-reference table, identifies the page's zero-based index from the page tree, and emits `target_page` as that index. The destination type keyword governs how the viewport fits to the target but is not needed for link extraction beyond noting the page number. + +For GoTo actions, `link_type` is `internal` in the output schema, `target_page` holds the zero-indexed page number, and `url` is null. + +## GoToR Actions and Cross-Document Links + +The `/S /GoToR` action type points to a page in a different PDF file. The `/F` key is a file specification — either a plain string path or a file specification dictionary with `/F` (DOS/Windows path), `/UF` (Unicode path), and `/FS` (file system type). pdftract extracts the most specific path available, preferring `/UF` over `/F`. The `/Dest` key follows the same syntax as a GoTo destination — either a named string or an explicit array — but pdftract cannot resolve the page number without reading the remote file, so it extracts the destination as a raw label string and marks `link_type` as `external`. The `url` field is populated with the file path, and a separate `destination_label` field carries the raw destination value for downstream consumers that can open the referenced file. + +## Named Destination Resolution + +Named destinations decouple the link reference from the physical page number, allowing documents to be reorganized without updating every annotation. A named destination in a GoTo or GoToR action is expressed either as a PDF name object (`/SomeName`) or as a PDF string (`(SomeName)` or a hex string). Both forms must be supported — name objects are limited to printable ASCII without spaces, while string-valued names can carry arbitrary characters. + +Resolution follows a two-path strategy. The modern mechanism is the `/Names` dictionary in the document catalog, which contains a `/Dests` entry pointing to a name tree. The legacy mechanism is a `/Dests` dictionary directly under the catalog — a flat mapping of name to destination. pdftract checks the name tree first, then falls back to the flat catalog dictionary. + +## The /Names→/Dests Name Tree + +A PDF name tree is a balanced B-tree structure stored as a graph of dictionary objects. Interior nodes carry a `/Kids` array of indirect references to child nodes, and a `/Limits` array of two strings giving the lexicographically smallest and largest names in the subtree. Leaf nodes omit `/Kids` and instead carry a `/Names` array of alternating name/destination pairs: `[(name1) dest1_array (name2) dest2_array ...]`. + +pdftract's traversal algorithm starts at the root node referenced by `/Names→/Dests`. If the root has `/Names`, it is already a leaf; iterate the array and build a hash map of name to destination. If the root has `/Kids`, examine each child's `/Limits` to prune branches that cannot contain the target name, then recurse into matching children. Because names in each node are lexicographically sorted, a binary search over the `/Names` array finds the target in O(log n) time per leaf. For bulk extraction — needed to annotate all links in a document in one pass — pdftract performs a full tree walk once and caches the resulting map, avoiding repeated traversals per annotation. + +After resolving a named destination to an explicit destination array, the page reference in the array is resolved to a zero-based page index using the page tree, exactly as for direct GoTo actions. The `target_page_label` field in the output is populated from the page labels subsystem (the `/PageLabels` number tree in the catalog), giving the human-readable label such as "iv" or "A-3" alongside the numeric index. + +## Annotation /QuadPoints for Non-Rectangular Links + +When a hyperlink spans multiple lines — for example, a two-line heading that serves as a table-of-contents entry — the rectangular `/Rect` is an enclosing bounding box that may include significant whitespace between lines. The `/QuadPoints` array provides a per-quadrilateral clickable region that precisely tracks the actual text geometry. Each group of eight numbers defines one quadrilateral as four corner points in the order: lower-left, lower-right, upper-right, upper-left (or the alternate order used by some producers — both must be handled). + +When `/QuadPoints` is present, pdftract uses those quadrilaterals for anchor text extraction rather than the gross `/Rect`. For each quadrilateral, compute its axis-aligned bounding box, intersect with text spans, and collect matching glyphs. The union of text from all quadrilaterals, de-duplicated and sorted by reading order, is the anchor text. This produces significantly cleaner results for multi-line links in structured documents such as academic papers and reference manuals. + +## Output Schema for Links + +Each extracted link is represented as an object in the top-level `links` array of the pdftract JSON output, keyed by source page: + +```json +{ + "anchor_text": "Section 4.2 — Data Types", + "url": null, + "target_page": 17, + "target_page_label": "18", + "link_type": "internal", + "source_page": 2, + "source_rect": [72.0, 611.5, 310.25, 624.0] +} +``` + +For URI links, `url` carries the decoded URL string, `target_page` and `target_page_label` are null, and `link_type` is `uri`. For GoToR external links, `url` holds the target file path, `link_type` is `external`, and a `destination_label` field carries the raw named or array destination. The `source_rect` is always reported in the same coordinate space as the page's `/MediaBox` origin, with the y-axis unflipped (PDF default lower-left origin), so consumers can map the rect back to extracted text spans using the same coordinate frame pdftract uses internally. + +## Link Density as a Document Signal + +Link density — the ratio of `/Link` annotation count to total text span count on a page — is a lightweight signal for classifying page function. Pages whose link density exceeds a threshold (empirically around 0.3 links per span works well for most document types) are likely tables of contents, indices, or navigation pages, not body text. pdftract annotates such pages with a `page_type: navigation` hint in the per-page metadata block. This hint is advisory: downstream consumers can use it to skip navigation pages in full-text extraction pipelines, route them to outline reconstruction logic, or flag them for special handling. The threshold is configurable via the extraction profile so that documents with dense inline citation links — common in legal briefs and academic papers — are not incorrectly classified. + +## Implementation Priorities + +Handling the full annotation and action type matrix requires disciplined fallback logic. Not all annotations carry an `/A` action dictionary — some use the annotation-level `/Dest` key directly, which must be checked when `/A` is absent. Not all named destinations live in the name tree — the flat catalog `/Dests` dictionary is widely used in documents produced by older toolchains and must be consulted as a fallback. Not all link annotations have `/QuadPoints` — the implementation must detect presence before attempting quadrilateral-based text extraction and silently fall back to `/Rect` intersection when the key is absent. With these fallbacks in place, pdftract covers the complete range of link annotation patterns found in production PDF files across publishing, legal, academic, government, and technical documentation domains. diff --git a/docs/research/resource-dictionary-and-inheritance.md b/docs/research/resource-dictionary-and-inheritance.md new file mode 100644 index 0000000..ec8772e --- /dev/null +++ b/docs/research/resource-dictionary-and-inheritance.md @@ -0,0 +1,86 @@ +# PDF Resource Dictionaries, Resource Inheritance, and Namespace Isolation + +## Overview + +Every operator in a PDF content stream that references a named resource — a font, an image, a graphics state, a pattern — resolves that name through a layered lookup mechanism defined by the PDF specification. Implementing this mechanism correctly is a prerequisite for accurate text extraction, because a single character of text depends on identifying the right font object, and font naming in PDF is strictly local, not global. This document describes the full resource resolution semantics that pdftract must implement, from the structure of resource dictionaries through inheritance traversal, namespace isolation in Form XObjects, and graceful handling of malformed references. + +## Resource Dictionary Structure + +The `/Resources` key on a page dictionary (or Form XObject dictionary) holds a resource dictionary that organizes all named resources available to that object's content stream. The PDF specification defines six typed sub-dictionaries within a resource dictionary: + +- `/Font` — maps local font names to font object references +- `/XObject` — maps local names to image or Form XObjects +- `/ExtGState` — maps local names to graphics state parameter dictionaries +- `/ColorSpace` — maps local names to color space definitions +- `/Pattern` — maps local names to tiling pattern or shading pattern dictionaries +- `/Shading` — maps local names to shading dictionaries used by the `sh` operator + +Each sub-dictionary is a PDF dictionary whose keys are name objects (the local resource names used in the content stream) and whose values are either inline dictionaries or indirect object references. A seventh entry, `/ProcSet`, historically listed the set of procedure sets required to interpret the stream. Modern PDF processors ignore `/ProcSet` entirely; it carries no semantic weight and pdftract must accept its presence without acting on it. + +## Resource Inheritance Through the Page Tree + +PDF page dictionaries are organized in a tree structure under the document catalog's `/Pages` node. Interior nodes of this tree (page tree nodes) may carry a `/Resources` entry that applies to all their descendant pages. A page dictionary that omits `/Resources` inherits the nearest ancestor's resource dictionary, found by traversing the `/Parent` chain upward until a node with `/Resources` is encountered. + +pdftract must implement this traversal explicitly. When resolving a resource name during content stream processing, the lookup begins with the page's own dictionary. If `/Resources` is absent, the processor walks the `/Parent` references — each `/Parent` pointer leads to a page tree node — and checks each ancestor in turn until a `/Resources` dictionary is found or the root is reached. The first `/Resources` encountered wins; shadowing does not propagate further up the chain. + +This has a practical implication for implementation: the resource dictionary associated with a page must be resolved before content stream processing begins, not lazily during operator dispatch. pdftract should perform the full inheritance walk at page-load time and cache the resolved resource dictionary for the lifetime of content stream processing for that page. + +## Font Resource Lookup + +Within the `/Font` sub-dictionary, each entry maps a local name (such as `/F1`, `/TT0`, or any arbitrary PDF name) to an indirect reference to a font object. The `Tf` operator in a content stream selects the current font by supplying one of these local names along with a size: `Tf /F1 12`. + +The critical property of font naming is that local names are scoped to the resource dictionary that contains them. The name `/F1` in page A's `/Font` sub-dictionary and the name `/F1` in page B's `/Font` sub-dictionary are entirely independent and may reference different font objects — different encodings, different glyph sets, different CIDFont definitions. There is no global font namespace in PDF. pdftract must never assume that a font name seen on one page carries any meaning on another page. Each page's font lookup must go through that page's (or its inherited) `/Font` sub-dictionary. + +## XObject Resource Lookup + +The `/XObject` sub-dictionary maps local names to XObject streams. An XObject is either an image (subtype `/Image`) or a Form XObject (subtype `/Form`). The `Do` operator in a content stream takes a single name operand, looks it up in the current context's `/XObject` sub-dictionary, and either renders the image or recursively processes the Form XObject's content stream. During text extraction, image XObjects are irrelevant to the character stream, but encountering a `Do` operator referencing a Form XObject requires that pdftract recursively enter and process that Form XObject's content stream to capture any text it contains. + +## Form XObject Resource Isolation + +Form XObjects are the primary source of complexity in resource resolution. A Form XObject has its own `/Resources` dictionary, embedded in its stream dictionary, that is entirely separate from the invoking page's resource dictionary. When pdftract enters a Form XObject via a `Do` operator, the resource context must switch completely to the Form XObject's own resources. Operators inside the Form XObject's content stream — including `Tf` operators that select fonts — resolve all names against the Form XObject's `/Resources`, not the page's. + +This isolation is absolute: the Form XObject's `/Font` sub-dictionary is the authoritative namespace for all `Tf` operators inside that Form XObject. A font named `/F1` inside the Form XObject refers to the object listed under `/F1` in the Form XObject's `/Font` sub-dictionary, regardless of what `/F1` means in the enclosing page's resources. + +## ExtGState Lookup + +The `/ExtGState` sub-dictionary maps local names to graphics state parameter dictionaries. The `gs` operator takes a local name, looks it up in `/ExtGState`, and applies the parameters in the referenced dictionary to the current graphics state. Several entries within an ExtGState dictionary are relevant to text extraction or rendering state: + +- `/Font` — a two-element array `[font-object size]` that sets the current font and size, equivalent to a `Tf` operation. pdftract must handle this the same way it handles an explicit `Tf` operator. +- `/ca` — non-stroking (fill) opacity; relevant for determining whether text is visually transparent. +- `/CA` — stroking opacity. +- `/BM` — blend mode; affects compositing but rarely changes text extraction logic. +- `/SMask` — soft mask; may affect visibility but is secondary to text position extraction. + +When a `gs` operator is encountered, pdftract must resolve the name through the current context's `/ExtGState` sub-dictionary (respecting the resource stack described below) and process any `/Font` entry within the resulting dictionary. + +## Nested Form XObjects + +A Form XObject's content stream may itself contain `Do` operators, referencing further Form XObjects listed in the Form XObject's own `/XObject` sub-dictionary. Nesting depth is unlimited by the specification. This means resource context switching is recursive: entering a second-level Form XObject switches to that object's `/Resources`, and returning from it restores the first-level Form XObject's resources. + +Although the PDF specification prohibits cycles in the XObject reference graph, malformed PDFs may include them. A Form XObject that directly or indirectly contains a `Do` reference to itself would cause infinite recursion if pdftract does not detect and break the cycle. pdftract must maintain a set of currently-active Form XObject object numbers during recursive processing. Before entering a Form XObject, its object number is checked against this set. If it is already present, pdftract skips the `Do` operation and records a warning. On return from a Form XObject, its object number is removed from the set. + +## Pattern and ColorSpace Resources + +The `/Pattern` sub-dictionary names tiling patterns and shading patterns, referenced by painting operators when a pattern color space is active. The `/ColorSpace` sub-dictionary names color spaces such as `DeviceN` or `Separation` that can be referenced by name in color selection operators. Neither of these resource types directly contributes to text extraction. However, pdftract must not crash when content streams reference pattern or color space names. A lookup that returns a pattern or color space object should be handled gracefully: apply no text-relevant state change, continue processing, and do not surface an error to the caller. + +## ResourceStack Implementation + +The correct abstraction for resource context during content stream processing is a stack of resource dictionaries. pdftract should define a `ResourceStack` structure that supports three operations: + +1. **Push** — called when entering a Form XObject; pushes the Form XObject's `/Resources` dictionary onto the top of the stack. +2. **Pop** — called when returning from a Form XObject; removes the top entry. +3. **Lookup(type, name)** — resolves a resource name within a given sub-dictionary type (`/Font`, `/XObject`, `/ExtGState`, etc.) by searching from the top of the stack downward, returning the first match found. + +At the start of content stream processing for a page, the stack is initialized with a single entry: the page's resolved (inheritance-applied) resource dictionary. Each `Do` operator that invokes a Form XObject pushes that Form XObject's `/Resources` onto the stack before recursing into the Form XObject's content stream, and pops it on return. + +The top-to-bottom search order in `Lookup` ensures that Form XObject-local names shadow any identically named resources in enclosing contexts. This is the correct behavior: a Form XObject's author controls the names within that Form XObject's scope without any obligation to avoid conflicts with names in the enclosing page. + +## Missing Resources and Graceful Degradation + +Malformed PDFs may contain content streams that reference resource names absent from the applicable `/Resources` dictionary. A `Tf` operator naming `/F3` when `/F3` does not appear in the current context's `/Font` sub-dictionary is not a fatal error; it is a recoverable condition. pdftract should treat a missing font lookup as an unknown font: the current font state is set to an unresolved sentinel, glyph-to-character mapping falls through to the fallback pipeline (Unicode inference from glyph names, ToUnicode CMap if a font object is eventually identified, or raw codepoint passthrough), and processing continues. The missing reference should be recorded in the extraction diagnostic log. + +Similarly, a `Do` operator referencing an XObject name absent from `/XObject` should log the missing reference and skip the operation rather than halting extraction. Pattern, color space, and ExtGState lookup failures follow the same pattern: log and continue. + +## Summary + +Correct PDF resource resolution requires implementing three interlocking mechanisms: inheritance traversal up the page tree to find the applicable `/Resources` dictionary, per-page (and per-Form-XObject) namespace isolation that prevents any cross-page or cross-object name aliasing, and a resource stack that tracks context switches as Form XObjects are entered and exited during recursive content stream processing. Font name resolution is the most critical path for text extraction; every `Tf` operator and every `/Font` entry in an ExtGState must resolve through the stack's current top context. Cycle detection prevents malformed inputs from causing unbounded recursion. Missing resources must degrade gracefully into the fallback pipeline rather than surfacing as hard failures. Together these behaviors allow pdftract to process content streams from simple single-page documents and deeply nested Form XObject hierarchies alike, extracting text accurately regardless of the resource structure the PDF author chose.