Add research: page labels, government forms, book publishing, filter decoding

Four new extraction research documents covering page label/PageLabels
number tree and outline/bookmark tree extraction, government form PDF
patterns (IRS, USCIS, court filings, classification markings), book and
publishing PDF structure (running heads, footnotes, index extraction),
and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global
segments, CCITTFax, JPX, error boundaries).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-16 15:55:08 -04:00
parent 5ff918b178
commit 516ca154aa
4 changed files with 359 additions and 0 deletions

View file

@ -0,0 +1,81 @@
# Book and Publishing PDF Extraction Patterns
## Overview
Book and publishing PDFs represent one of the most structurally complex document types that a text extraction library must handle. Unlike business documents or forms, books are designed first as physical artifacts — with front matter, body chapters, and back matter serving distinct purposes — and only later digitized into PDF form. Extracting clean, ordered text from these documents requires understanding both the physical layout conventions of printed books and the typographic idioms that publishers use to communicate structure to human readers. pdftract must map these conventions accurately to produce usable output.
---
## 1. Book Structure and Logical Organization
A typeset book follows a predictable physical sequence that does not always map cleanly to PDF page order. Front matter is paginated separately in lowercase Roman numerals (i, ii, iii...) and includes the half title page (title only, no subtitle), the full title page (title, subtitle, author, publisher), the copyright page, any dedication, the table of contents, lists of figures or tables, a foreword (written by a third party), and a preface (written by the author). Body chapters begin at Arabic page 1. Back matter — appendices, bibliography, glossary, and index — continues the Arabic numbering.
pdftract must track two parallel page number streams when present: the typeset page number embedded in the layout (extracted from running footers or explicit text) and the PDF page index (0-based). These diverge at the front matter boundary and must be reconciled to allow consumers to reference extracted text by canonical book page rather than PDF position. A chapter opening on PDF page 14 may carry the typeset page number 1.
## 2. Running Headers and Footers
Professional typeset books place running headers (also called running heads) at the top of body pages. The verso (left/even) running head typically carries the book title or author name; the recto (right/odd) running head carries the current chapter title. Page numbers appear in the footer, often at the outer margin — left on verso pages, right on recto pages — though some designs place them in the header alongside the running head text.
These elements must be extracted and classified as artifact types rather than body text. pdftract should emit `page_header` and `page_footer` artifact records that carry the extracted text and the page position, but exclude them from the primary text stream. The challenge is distinguishing running heads from genuine content: they appear at fixed vertical positions, use a consistent smaller font (often 89pt italic), and repeat across pages. Heuristic detection should compare font size against the median body font size, check vertical position against a band threshold (top 8% or bottom 8% of the text area), and verify repetition across consecutive pages before classifying a text block as an artifact.
Chapter opening pages (rectos) and section break pages typically suppress the running header entirely. pdftract must handle missing headers gracefully without treating the absence as an anomaly.
## 3. Verso/Recto Layout and Reading Order
Even/odd page mirroring affects margins, header position, and chapter opening conventions. In traditional book design, chapters open on recto (odd, right-hand) pages. This means that if a chapter ends on a recto, the next verso is intentionally left blank — a "blank verso" — before the new chapter begins. PDF generators from InDesign and QuarkXPress typically include these blank pages in the page stream, contributing a page with no content or only a minimal footer.
pdftract must detect and handle blank versos without misinterpreting them as extraction failures. A page containing zero text runs, or only a page-number footer, should be classified as a structural spacer and noted in metadata. Reading order reconstruction must skip these pages in the content flow while preserving their presence in the page index mapping. Content consumers that are building a chapter-by-chapter text corpus need to know that page 47 was blank and not incorrectly infer a merge between the text on pages 46 and 48.
## 4. Typography: Small Caps, Drop Caps, and Ornamental Dividers
Book typography uses several conventions that challenge naive character extraction.
**Small caps** are used for author names on title pages, for chapter headings in some traditions, and for acronyms in body text. In PDF, small caps may be represented as a dedicated small-caps font variant (e.g., `MinionPro-SmallCaps`) or synthesized by scaling uppercase glyphs. When a dedicated font is used, extraction is straightforward — the characters decode to their Unicode values normally. When small caps are synthesized, the glyphs are uppercase letters rendered at a reduced point size, and extraction may produce a mix of apparent font sizes within a single word. pdftract should detect synthesized small caps by identifying runs of uppercase characters at a size approximately 7080% of the surrounding text's cap height and normalize them.
**Drop caps** (also called versals or initial caps) span two to three lines of text. In PDF layout, a drop cap is typically a separate text object positioned to the left of a text block, with the text block indented to accommodate it. Extraction that processes objects in top-to-bottom order will often place the drop cap character after the indented paragraph text, garbling the opening of the chapter. pdftract must detect drop caps by identifying oversized single-character text objects whose bounding box intersects the vertical extent of an adjacent paragraph, then prepend the character to that paragraph's text.
**Ornamental dividers** separate sections within a chapter. Common forms include centered asterisks (`* * *`), em-dash sequences (`———`), floral ornaments rendered as glyph characters in symbol fonts, or decorative rule images. Text-based dividers extract cleanly if the font encoding is standard. Ornamental glyphs in symbol fonts require mapping via the font's ToUnicode table or, failing that, classification as a divider artifact based on position (vertically centered in white space, horizontally centered on the text block). pdftract should emit these as `section_break` structural markers rather than body text.
## 5. Chapter and Section Headers
Headers in typeset books are distinguished by font size, weight, alignment, and surrounding white space. Chapter titles are typically 1824pt, often centered, preceded by several lines of vertical space (the chapter opening sink) and followed by additional space before body text begins. Section headers (A-heads) are 1214pt bold or small caps; subsection headers (B-heads) may be 11pt bold italic run-in at the start of a paragraph.
pdftract must classify headers by analyzing font size relative to body text, font weight flags from the PDF font descriptor, and the vertical gap before and after the text block. A block whose font size is more than 1.4× the modal body size, centered horizontally, and preceded by more than 2× the normal line spacing is a strong candidate for a chapter or section header. These should be emitted with a structural tag (`h1` for chapter title, `h2` for A-head, `h3` for B-head) so consumers can reconstruct the document outline without post-processing heuristics.
## 6. Index Extraction
The index presents a unique extraction challenge. Index pages are typically set in two columns at a smaller font size (89pt) with indentation encoding hierarchy. A top-level entry consists of a term followed by page numbers. Sub-entries are indented 1 em. Page number ranges use an en-dash (2348), and multiple references are separated by commas (23, 4548, 112).
Correct extraction requires column detection before line reconstruction. pdftract must identify the gutter between columns and process each column as an independent text stream before merging. Within each column, indentation level (measured in points from the left column margin) distinguishes top-level entries from sub-entries. The page number list at the end of each entry must be parsed separately to support structured output — a consumer building an inverted index needs term, hierarchy level, and page number array, not a flat string.
## 7. Footnotes and Endnotes
Footnotes appear at the bottom of the page, separated from the body text by a short rule (the footnote divider) that spans approximately one-third of the text width. Footnote text is set in a smaller font (typically 89pt) and keyed to inline markers in the body — superscript numerals, symbols (`*`, `†`, `‡`), or letters, depending on the publisher's style.
pdftract must associate each footnote block with its inline marker. The inline marker is a superscript character (identifiable by a vertical offset above the baseline and a smaller font size) at a specific position in the body text. The footnote block at the page bottom carries a matching marker at its start. Association is by marker identity within the page scope. For endnotes — collected at chapter end or at the book's back matter — the association must span pages: the inline marker in chapter 3 body text refers to a note in the endnotes section 50 pages later. pdftract should preserve marker identity and emit note references so that downstream consumers can perform cross-page linking.
## 8. Copyright Page and ISBN Extraction
The copyright page (typically the verso of the title page) contains structured bibliographic data: publisher name and address, copyright year, ISBN-10 and ISBN-13, edition number, printing history, Library of Congress Cataloging data, and legal notices. This data is valuable for bibliographic enrichment and must be extracted as structured fields, not flat text.
pdftract should apply pattern matching for ISBN-13 (978- or 979- prefix, 13 digits with optional hyphens), ISBN-10 (10 digits or 9 digits plus X), copyright year (© followed by 4-digit year), and edition markers ("First edition", "Second printing"). The copyright page is identifiable by its position (verso of title page, Roman-numeral page iv or the equivalent) and the density of these patterns. Structured output should separate publisher, year, ISBNs, and edition into discrete fields.
## 9. Publisher PDF Generator Profiles
The `/Producer` entry in the PDF document information dictionary reveals the toolchain used to create the file, and this has predictable implications for extraction quality.
InDesign exports via Direct Export or via Distiller produce well-formed font encoding with complete ToUnicode maps, accurate glyph widths, and predictable text object ordering. These are the cleanest source for extraction. QuarkXPress PDFs are similarly clean but may use older Type 1 fonts with encoding vectors that require manual mapping for ligatures (fi, fl, ff, ffi, ffl). TeX-based PDFs from academic publishers using pdfTeX or XeTeX embed Type 1 or OpenType fonts with full Unicode mapping when compiled with modern packages; older TeX-to-PostScript-to-PDF workflows may produce PDFs with encoding gaps for common ligatures. pdftract should detect the generator string and apply generator-specific heuristics: for TeX sources, apply ligature normalization; for QuarkXPress sources, verify encoding against known encoding vectors.
## 10. EPUB-to-PDF Conversion Artifacts
A significant share of ebook-era books are produced as EPUBs first and converted to PDF via Calibre, Pandoc, or browser print-to-PDF. These conversions produce structurally different PDFs from native InDesign output.
EPUB-sourced PDFs typically have clean Unicode throughout — no encoding gaps, no ligature issues — because the source is HTML. However, they introduce other artifacts: chapter break pages that are blank or contain only a decorative image, inconsistent heading font sizes (because CSS heading styles do not map cleanly to fixed typographic conventions), and ornamental chapter separators as inline images rather than text. Page margins are often uniform (no verso/recto mirroring), and there are no running headers, so the header/footer detection heuristics that work for InDesign output do not apply.
pdftract should detect EPUB-sourced PDFs by checking for common generator strings (Calibre, wkhtmltopdf, headless Chrome, Prince) and adjust extraction strategy accordingly: skip header/footer artifact detection, treat blank pages as chapter breaks rather than structural spacers, and apply a more permissive heading-detection threshold given the flatter typographic hierarchy common in reflowable EPUB layouts.
---
## Summary
Clean text extraction from book and publishing PDFs requires handling two distinct layers: the physical layout conventions of printed books (running heads, verso/recto mirroring, chapter sinks, blank pages) and the typographic idioms that encode structure (drop caps, small caps, ornamental dividers, header hierarchies). pdftract must detect and classify these elements as structural artifacts or semantic markers rather than treating all text objects as body content. The generator-detection approach allows the extraction pipeline to apply source-appropriate heuristics, producing well-ordered, structurally annotated output that faithfully represents the logical organization of the original work.

View file

@ -0,0 +1,103 @@
# Government Form and Regulatory PDF Extraction Patterns
## Overview
Government-origin PDFs represent one of the most structurally diverse and extraction-challenging categories a PDF library will encounter. Unlike commercial documents produced by a single authoring toolchain, government forms span decades of software, print-and-scan workflows, AcroForm interactivity, security paper, barcodes, and classification markings. pdftract must handle each of these patterns correctly to produce complete, usable text output rather than partial or silently incorrect extractions.
---
## IRS Tax Form PDFs
IRS forms such as the 1040, W-2, and the Schedule series are AcroForm PDFs with named field annotations. Each numbered line — Line 1, Line 7a, Line 22b — corresponds to a distinct AcroForm widget. pdftract must extract both the field name (as a structured label) and the field value, preserving the line numbering as a key into the extracted record. Checkbox fields for filing status (Single, Married Filing Jointly, Head of Household, and so on) carry a Boolean value in the field annotation and must not be confused with adjacent label text.
A critical edge case arises when a taxpayer prints a partially completed e-file form, fills in handwritten amounts, and rescans it. In this case the AcroForm values are absent — the form is now a scanned image — and computed totals that would otherwise appear in widget values are only recoverable through OCR of the scanned pixel layer. pdftract must detect the absence of AcroForm data on a form that structurally resembles a known AcroForm template and escalate to OCR rather than returning empty fields. Line-position heuristics (vertical Y coordinate buckets aligned to IRS layout grids) can recover labeled numeric values even when OCR confidence is imperfect.
Schedule attachments (Schedule B, Schedule D, Schedule SE, and so on) are typically embedded as additional pages within the same PDF file. pdftract should preserve page-level provenance — attaching each extracted field to the page index from which it came — so callers can distinguish Form 1040 page 1 data from Schedule D capital gains tables.
---
## Immigration Form PDFs (USCIS I-Series and N-400)
USCIS forms such as the I-130, I-485, I-765, and N-400 follow a predictable multi-part section structure labeled with capital letters: Part 1, Part 2, Part 3, and so on, each subdivided into numbered items. pdftract should recognize this section hierarchy and expose it in extraction output as a nested structure keyed on part and item number, not merely as a flat ordered list of field values.
Checkbox fields in USCIS forms carry high semantic weight. A checkbox for "Yes" or "No" in response to a question about criminal history, prior immigration violations, or membership in prohibited organizations is legally significant. pdftract must preserve checkbox state — checked or unchecked — and associate it unambiguously with the parent question text. When multiple checkboxes appear within a single question (for example, "check all that apply"), each must be individually annotated.
Signature pages present a distinct challenge. The signature itself is typically an image or a user-drawn annotation; pdftract should flag the signature field as `signature_field` in metadata and extract the surrounding attestation text (the printed legal declaration above the signature line) as normal text. It is never correct to suppress or skip signature pages.
Barcode pages are appended to USCIS forms when generated through the USCIS online filing system or certain immigration software packages. These pages contain a PDF417 or similar 2D barcode encoding the entire form submission as a binary payload. pdftract detects such pages, flags each detected barcode region as `barcode_detected` with its bounding box coordinates, and does not attempt to decode the binary payload as text. Any human-readable data printed adjacent to the barcode — a confirmation number, applicant name, or form identifier — is in the normal vector text layer and is fully extractable.
---
## US Passport and Visa Application Forms
The DS-11 (passport application) and DS-160 (nonimmigrant visa application) follow a biographic data field pattern: surname, given name, date of birth, place of birth, Social Security Number, and travel document details. These fields are either AcroForm widgets or pre-labeled grid cells depending on the version and whether the form was electronically generated or pre-printed.
A photograph placeholder occupies a designated rectangular region on these forms. The placeholder is an image container, not text. pdftract must recognize photograph placeholders by their aspect ratio and position within the form layout and annotate them as `photograph_placeholder` in extraction metadata rather than attempting to interpret the image content as text. Checkbox responses — for questions about criminal history, dual nationality, or prior visa refusals — follow the same extraction rules as immigration forms: preserve state and parent question association.
---
## Government Procurement Forms (SF-86, SF-1449, DD-254)
Federal procurement and security clearance forms are dense structured tables. The SF-86 (Questionnaire for National Security Positions) contains over 120 pages of field tables with text, checkbox, and date inputs. The SF-1449 (Solicitation/Contract/Order) and DD-254 (Contract Security Classification Specification) similarly use tabular grid layouts where cell boundaries delineate field scope.
pdftract must use cell boundary geometry — detected from vector path segments or whitespace analysis — to associate field values with their labels correctly. In multi-column procurement forms, naive reading-order extraction produces garbled output by interleaving column A and column B content. Geometric table detection must take precedence over Unicode reading order for these forms.
Classification markings in headers and footers appear as boldface centered text on procurement forms (discussed further below). Certification blocks — contractor signature, date, and Contracting Officer Representative fields — should be preserved with their structural context.
---
## Court Filing Cover Sheets and Civil Cover Sheets (JS-44)
The JS-44 civil cover sheet filed with federal district courts contains a checkbox array for nature-of-suit codes, jurisdiction basis, and origin of the action. Each checkbox corresponds to a category code (for example, 422 for Bankruptcy Appeal, 110 for Insurance). pdftract must extract both the checkbox state and the numeric code, since the code — not the label text — is the authoritative data element consumed by court filing systems.
Party information fields (plaintiff name, defendant name, attorneys of record, county of residence) are typically AcroForm fields or typed-text overlays on a pre-printed form background. Case category codes in the nature-of-suit section appear in dense two-column checkbox arrays; geometric layout analysis ensures the correct code is associated with each checked entry.
---
## Government-Generated Flat PDF Reports
Not all government PDFs are interactive forms. FedBizOpps and SAM.gov opportunity PDFs, FOIA response packets, and regulatory docket documents published by agencies such as the EPA, SEC, or FTC are typically flat PDFs generated by report engines or document management systems. These contain no AcroForm fields whatsoever. The entire content is vector text organized in paragraphs, tables, and headers.
pdftract's extraction path for these documents is straightforward text and table extraction without any field-detection logic. However, FOIA response packets often combine generated cover letters (vector text) with scanned exhibit pages (rasterized images), requiring pdftract to handle mixed-mode PDFs on a per-page basis — applying OCR only to pages where no selectable text layer exists, and using the vector text layer directly on pages where it is present.
---
## Certificate and License PDFs
Professional licenses, birth certificate printouts, deeds, and government-issued certificates present a different extraction challenge. The semantic content — licensee name, license number, issuance date, expiration date, issuing authority — is encoded as vector text and is fully extractable. The visual complexity of these documents comes from decorative and security elements that are not text.
Embossed seals appear as rasterized images embedded in the PDF. Security paper backgrounds — colored fiber patterns, guilloche designs, watermarks — are either embedded images or vector graphic layers. pdftract should ignore image content that does not contain extractable text and should not attempt to OCR decorative background layers. The structured text fields on these certificates remain in the vector text layer and can be extracted directly without any image processing.
---
## Government Scan and OCR PDFs
A substantial fraction of government documents available through FOIA releases, court dockets, and agency archives are legacy scans. These were typically scanned at 200 to 300 DPI on flatbed or document-feed scanners and subsequently OCR'd, sometimes with the OCR text embedded as an invisible layer over the raster image. Quality varies significantly by agency and era.
pdftract applies its standard OCR correction pipeline to these documents: character confidence scoring, dictionary-based correction for common OCR errors (rn/m substitution, 0/O confusion, l/1 confusion), and layout reconstruction to recover paragraph and column structure from line-coordinate clustering.
Government scan artifacts require specific handling. Hole punches along the left margin of three-ring-binder documents create dark circular regions that OCR engines frequently misinterpret as characters. pdftract detects circular high-contrast regions within the left margin zone (approximately 0.5 inches from the left edge) and masks them before OCR. Stamps — RECEIVED, APPROVED, CLASSIFIED, VOID — are typically rotated text images overlaid on the document. pdftract detects high-contrast rectangular or free-form rotated overlays and processes them separately, flagging detected stamp regions as `overlay_stamp` in per-page metadata with the extracted text if legible.
---
## Barcodes in Government Forms
PDF417, QR, and Code 39/128 barcodes appear as rasterized images within government PDFs. The barcode payload itself — whether it encodes form data, a tracking number, or an application identifier — is not text-extractable by pdftract. Attempting to decode barcode pixel data as text produces garbage output.
The correct behavior is detection and flagging. When pdftract identifies an image region that matches barcode structural characteristics (high-frequency vertical striping for 1D barcodes, square matrix patterns for 2D barcodes), it records a `barcode_detected` annotation in the extraction output with the bounding box of the barcode image in page coordinates. Human-readable text printed above or below the barcode — a form number, a confirmation code, an applicant identifier — is in the vector or OCR text layer and is extracted normally. callers that need barcode payloads must route those image regions to a dedicated barcode decoder outside pdftract.
---
## Classification and Handling Markings
Government documents subject to information controls carry standardized marking strings in page headers and footers. Common markings include UNCLASSIFIED, CONTROLLED UNCLASSIFIED INFORMATION (CUI), FOR OFFICIAL USE ONLY (FOUO), SENSITIVE BUT UNCLASSIFIED (SBU), and PRIVACY ACT PROTECTED. Classified documents add SECRET, TOP SECRET, and compartment designators, though these are uncommon in documents available through public channels.
These markings appear as text — typically bold, centered, uppercase — in repeating header and footer positions across all pages of the document. pdftract extracts them as normal text but additionally inspects the set of recognized marking strings and records any matches in a `handling_markings` array in the document-level extraction metadata. This allows callers to surface classification status programmatically without parsing free-form header text.
The marking strings themselves are not redacted or suppressed. A document marked CUI may have substantive content redacted (appearing as black rectangles or blank space), but the CUI marking and any handling instructions (such as CUI//PRVCY or CUI//LAW) are always present in the text layer and must be preserved in extraction output.
---
## Summary
Government form PDFs demand that pdftract correctly navigate AcroForm fields with semantic labels, multi-part section hierarchies, checkbox state preservation, mixed scan-and-vector page composition, barcode detection without payload decoding, geometry-driven table extraction, and classification marking identification. No single extraction strategy covers this space. pdftract's layered approach — AcroForm field extraction, vector text extraction, geometric layout analysis, per-page mode detection, and OCR with artifact correction — provides the coverage necessary to produce complete, accurate, and structurally faithful text from the full range of government-origin PDFs.

View file

@ -0,0 +1,73 @@
# PDF Stream Filters, Image Compression, and Decoding for Text Extraction
## Overview
PDF content streams and image XObjects are almost never stored as raw bytes — they pass through one or more compression filters before being written to the file. pdftract must reverse exactly the sequence of filters applied at write time before raw pixel data becomes accessible. A single mishandled filter leaves an entire page blank; a crash inside a decoder can abort extraction for every subsequent page. This document covers each filter pdftract must support, the parameters that govern its behavior, and the error-handling discipline required to survive malformed streams.
## The Filter Pipeline
The `/Filter` entry in a stream dictionary may be either a single name (e.g., `/FlateDecode`) or an array of names (e.g., `[/ASCII85Decode /FlateDecode]`). When an array is present, the filters are listed in the order they were applied during encoding, which means pdftract must apply decoders in the same order: the first filter in the array is decoded first, its output fed into the second decoder, and so on. The companion `/DecodeParms` entry mirrors this structure — either a single parameter dictionary or an array of dictionaries (or null entries for filters that take no parameters) aligned positionally with the filter array.
pdftract must treat `/Filter` and `/DecodeParms` as a paired pipeline. If `/DecodeParms` is shorter than `/Filter` or contains null entries, the corresponding decoders apply their defaults. Any count mismatch is malformed-but-recoverable: apply defaults for the unpaired stages and log the discrepancy.
## FlateDecode
FlateDecode is the dominant filter in modern PDFs, used for content streams, embedded font data, image data, and cross-reference streams (since PDF 1.5). The payload is a standard zlib stream (RFC 1950 wrapping deflate), so the inflate step is straightforward with any conformant zlib implementation.
The complication lies in the `/Predictor` parameter inside `/DecodeParms`. A predictor value of 1 (or absent) means no prediction was applied. A value of 2 indicates TIFF predictor 2 (horizontal differencing): each sample is stored as a delta from the previous sample on the same row. Reconstruction adds each delta to a running accumulator, column by column.
PNG predictors occupy values 10 through 15. Value 10 is None; 11 is Sub (delta from the left pixel); 12 is Up (delta from above); 13 is Average (floor of left + above, divided by 2); 14 is Paeth. Value 15 means the optimal predictor was chosen per-row: each row is prefixed by a single byte naming the predictor for that row, which pdftract must read and strip before applying the inverse transform.
Correct FlateDecode requires: inflate the zlib stream, then iterate over rows applying the inverse predictor using `/Columns`, `/Colors`, and `/BitsPerComponent` to determine row stride. Skipping the predictor step scrambles pixel data in a way that produces garbage OCR output without triggering any obvious decode error.
## LZWDecode
LZWDecode is the predecessor to FlateDecode, defined since PDF 1.0 and still present in documents from early desktop publishing. It uses LZW with 912 bit codes. The `/EarlyChange` parameter is critical: a value of 1 (default) means the encoder incremented the code width one entry early, before the table filled. A value of 0 selects late change, matching a stricter pre-PDF-1.2 interpretation. Decoding with the wrong setting produces plausible but incorrect bytes with no detectable error. LZWDecode supports the same `/Predictor` mechanism as FlateDecode, and pdftract must apply the identical post-decompression reconstruction.
## ASCII85Decode and ASCIIHexDecode
These filters provide ASCII armor for binary data, historically used to safely transmit PDFs over channels that corrupt eight-bit bytes. Both still appear in PDFs generated by certain print workflows.
ASCII85Decode encodes every four binary bytes as five printable characters in the range `!` through `u` (ASCII 33117), representing the base-85 digits of a 32-bit big-endian value. An all-zero group is represented by the single character `z` instead of `!!!!!`. The stream terminates with `~>`, and whitespace is ignored throughout. A final group of fewer than four bytes is padded to four, encoded, and only the first (n+1) characters of the five-character result are emitted. pdftract must handle partial final groups, the `z` shortcut, and embedded whitespace.
ASCIIHexDecode is simpler: each byte is two hex digits (upper or lower case), whitespace ignored, terminated by `>`. pdftract reads digit pairs until the terminator.
## DCTDecode (JPEG)
DCTDecode wraps a standard JPEG bitstream. The data is a complete JPEG file including SOI and EOI markers, so pdftract passes it directly to Tesseract without re-encoding, preserving quality and avoiding unnecessary decode-reencode cycles.
The `/ColorTransform` parameter controls color space interpretation. For three-component images, a value of 1 (the default) means YCbCr, requiring conversion to RGB before use; a value of 0 means the data is already RGB. For four-component CMYK JPEG, the default is 0 (no transform); a value of 1 means Adobe YCCK encoding. CMYK requires conversion to RGB before Tesseract can process it — the standard inversion formula (R = 255 (C × (255 K) / 255) K, and similarly for G and B) is adequate for OCR purposes.
JPEG restart markers (RST0RST7) partition entropy-coded data into independently decodable segments; a conformant JPEG library handles them transparently.
## JPXDecode (JPEG 2000)
JPXDecode wraps a JPEG 2000 bitstream, available since PDF 1.5, and is commonly used for high-resolution scans. A JPEG 2000 stream in PDF is a self-contained JP2 file that may embed an ICC color profile in its JP2 header box structure.
For OCR preprocessing, pdftract decodes the JP2 stream to a raw pixel array, applies any embedded ICC profile conversion to reach standard RGB or grayscale, and passes the result to Tesseract. The OpenJPEG library provides a well-tested open-source decoder. pdftract must treat memory allocation failure as a recoverable per-image error — JPEG 2000 images often expand to tens of megabytes — rather than a fatal condition.
## JBIG2Decode
JBIG2 is a bi-level (one bit per pixel) compression standard that achieves very high compression ratios on scanned text by identifying repeated symbol shapes. It is extremely common in scanned PDFs produced by office copiers and document management systems.
PDF embeds JBIG2 as two parts: an optional global segment stream in a separate XObject (referenced by `/JBIG2Globals` in `/DecodeParms`) containing shared symbol dictionaries, and per-page segment data in the filter stream. pdftract must fetch and retain the global stream for the lifetime of the document, prepend it to each page's local segments, and present the assembled bitstream as a single coherent JBIG2 file to the decoder.
Failing to prepend the global dictionary causes the decoder to fail on every symbol reference, producing a blank or garbage image with no clear error. The libjbig2dec library handles two-part assembly correctly when segments arrive in order.
## CCITTFaxDecode
CCITTFaxDecode encodes bi-level images using fax standards. The `/K` parameter selects the algorithm: 0 is Group 3 one-dimensional (T.4 1D); a positive integer is Group 3 mixed (rows alternate between 1D and 2D, with at most K consecutive 2D rows); 1 is Group 4 two-dimensional (T.6), the most compact and most common in PDFs.
`/EndOfLine` (default false) indicates whether each row ends with an EOL code; `/EncodedByteAlign` (default false) forces rows to start on byte boundaries; `/Columns` gives image width; `/Rows` gives height. pdftract must pass all four values to the CCITT decoder. An incorrect `/Columns` misaligns every row, producing text that appears diagonally shredded — visually obvious but not always self-diagnosing.
## RunLengthDecode
RunLengthDecode uses a simple packet encoding: a byte in 0127 means the next (byte+1) bytes are literal; 129255 means the next single byte is repeated (257byte) times; 128 is the end-of-data marker. This filter is rare in modern PDFs, appearing mainly in older bi-level and indexed-color images. The decoder is straightforward and unlikely to be a source of failure.
## Filter Error Handling
Malformed filter data is a fact of life for any PDF reader operating on documents from diverse sources. Corrupted streams, truncated downloads, and PDF generators that miscount stream lengths all produce inputs no conformant decoder can process. pdftract must apply a consistent recovery discipline: isolate filter decoding for each image XObject in its own error boundary; on any decode failure (zlib checksum error, premature end-of-stream, invalid ASCII85 symbol, missing JBIG2 global dictionary), log the stream object number and the failure reason, mark the image undecodable, and continue processing remaining content on the page.
Partial decode is sometimes better than discarding the image entirely. For FlateDecode and LZWDecode, inflate output up to the point of failure often contains complete rows of pixel data. pdftract should attempt partial decode for these filters, padding the output to the expected dimensions with white pixels before passing to Tesseract, which handles sparse input gracefully. For JBIG2 and DCT, partial data is not usable and should be discarded.
Invalid `/DecodeParms` values — unknown predictor codes, out-of-range `/K`, negative `/Columns` — must not cause panics. pdftract validates all parameters on parse, substitutes safe defaults for out-of-range values, and logs the substitution. No malformed stream should prevent extraction of text that is correctly encoded elsewhere in the same document.

View file

@ -0,0 +1,102 @@
# Page Labels, Outlines, and Document Navigation Structure
## Overview
PDF documents carry navigation metadata that goes far beyond raw text content. Page labels define how a document's pages are logically numbered — Roman numerals for front matter, alphabetic codes for appendices, decimal for body chapters. The outline tree (commonly called bookmarks) encodes the document's hierarchical structure as a tree of titled entries each pointing to a specific page and position. Named destinations bridge these two systems, providing stable symbolic references that both outline items and in-page hyperlink annotations can target. For pdftract, implementing full extraction of this navigation layer transforms output from a flat stream of text into a structured artifact that reflects the document author's intent.
---
## 1. Page Labels: Logical Numbering via /PageLabels
The PDF specification stores page label definitions in a number tree rooted at `/PageLabels` in the document catalog. A number tree maps integer keys (physical page indices, zero-based) to label range dictionaries. Each dictionary defines how pages are labeled from that index until the next range begins.
Each label range dictionary contains up to three fields:
- `/S` (style): the numbering style applied within this range. Legal values are `/D` (decimal Arabic), `/R` (uppercase Roman), `/r` (lowercase Roman), `/A` (uppercase alphabetic AZ, AAZZ, …), and `/a` (lowercase alphabetic). Omitting `/S` produces pages with no numeric component — the label is the prefix string alone.
- `/P` (prefix): an optional PDF string prepended to every label in the range. A prefix of `"App-"` combined with `/S /D` and a start value of 1 yields labels `App-1`, `App-2`, and so on.
- `/St` (start value): the integer at which counting begins within the range. Defaults to 1 if absent.
A typical scholarly monograph might define three ranges: physical pages 07 labeled with `/S /r` (lowercase Roman: i through viii), physical page 8 onward labeled with `/S /D /St 1` (decimal: 1, 2, 3, …), and a final range starting at the first appendix page labeled with `/P "A-" /S /D /St 1` (A-1, A-2, …). Back matter can resume a fresh decimal sequence by introducing another range with the appropriate `/St` offset.
pdftract must parse the `/PageLabels` number tree in full and precompute a mapping from every physical page index (0-based) to its logical label string. This mapping is then available at extraction time so that every output object — text blocks, annotations, outline entries — can carry both `page_index` (the zero-based physical position) and `page_label` (the human-readable string such as `"vi"` or `"A-3"`). Exposing both values is essential: downstream consumers that want to render "Page vi of xii" use the label, while those doing positional math use the index.
---
## 2. The Outline Tree: /Outlines and Its Node Structure
The document catalog's `/Outlines` entry points to an outline dictionary that serves as the root of the bookmark tree. The root itself is not displayed; it acts as a container whose `/First` and `/Last` entries reference the first and last top-level outline items respectively.
Each outline item is a dictionary with the following fields:
- `/Title`: a PDF string (potentially UTF-16BE encoded) that contains the visible label shown to the reader. Extraction requires decoding byte-order-mark-prefixed UTF-16BE correctly, falling back to PDFDocEncoding for byte strings without the BOM.
- `/Parent`: a reference back to the containing node (the root or another outline item).
- `/First` / `/Last`: references to the first and last child items if the entry has children.
- `/Next` / `/Prev`: references to the adjacent siblings within the same parent's child list.
- `/Count`: an integer indicating the number of descendant items visible when this node is open. A negative `/Count` signals that the node is collapsed in the viewer; a positive value signals it is open. The absolute value gives the total descendant count. pdftract should record both the count and whether the node was collapsed.
- `/F` (flags): a bitmask. Bit 1 (value 1) means italic rendering; bit 2 (value 2) means bold. These can be combined.
- `/C` (color): an array of three floats in the DeviceRGB space for the title's display color. Absent means black.
Traversal of the outline tree is a linked-list walk, not an array iteration. Starting at the root's `/First`, pdftract follows `/Next` pointers across siblings and recursively descends into children via `/First` at each node that has them, tracking depth as it goes.
---
## 3. Outline Item Destinations
Each outline item points to a location in the document through either a `/Dest` entry or an `/A` (action) entry.
A `/Dest` value is either an array or a name/string that references a named destination. An explicit destination array has the form `[page_ref /XYZ left top zoom]` where `page_ref` is an indirect reference to the target page object, `/XYZ` is the most common destination type (others include `/Fit`, `/FitB`, `/FitH`, `/FitV`), and `left`, `top`, `zoom` are optional coordinate and zoom parameters that may be `null`. pdftract resolves the page reference against the document's page tree to determine the zero-based physical page index, then looks up the page label from the precomputed mapping.
When `/Dest` is a string or name, it is a named destination reference. Named destinations are stored in one of two places: the `/Dests` dictionary directly under the document catalog (older format, maps name to destination array), or the `/Names` dictionary's `/Dests` name tree (modern format, a balanced tree structure mapping string keys to destination arrays or dictionaries). pdftract must resolve named destinations by checking both locations. The resolution produces the same kind of destination array, from which the page reference is extracted identically.
When the outline item uses `/A` instead of `/Dest`, the value is an action dictionary. The relevant cases are:
- `/S /GoTo` with a `/D` entry: a within-document GoTo action. The `/D` value is a destination, treated identically to a `/Dest` entry — either an explicit array or a named destination string.
- `/S /GoToR` with a `/F` (file spec) entry: a cross-document GoTo action targeting another PDF file. pdftract should record these as unresolvable with a note that the target is external, rather than attempting file system resolution.
- `/S /URI` with a `/URI` entry: a hyperlink to a web address. In outline items this is unusual but valid; pdftract records the URI string.
---
## 4. Structured Outline Output
pdftract should serialize the outline tree as a JSON array of hierarchical node objects. Each node carries:
```json
{
"title": "Chapter 3: Signal Processing",
"level": 2,
"page_index": 47,
"page_label": "38",
"open": true,
"bold": false,
"italic": false,
"children": [ ... ]
}
```
`level` is the zero-based depth in the tree (top-level items are level 0). `page_index` and `page_label` are both included. `open` reflects the sign of `/Count`. The `children` array is present and may be empty; it is never omitted, which allows consumers to handle the structure uniformly without null checks. Items whose destinations could not be resolved (named destinations absent from the document, cross-file GoToR actions) include `"page_index": null` and `"page_label": null` with a `"destination_type"` field set to `"external"` or `"unresolved"` as appropriate.
---
## 5. Outline as a Reading-Order and Heading Hint
For structured documents — technical reports, academic books, reference manuals — the outline tree encodes the heading hierarchy that the author intended. Outline items at level 0 typically correspond to chapters or major sections; level 1 items to subsections; level 2 to sub-subsections. The title strings often exactly match the heading text on the target page.
pdftract can exploit this relationship during text extraction. After extracting text blocks from a page, each block's bounding box can be compared against outline entries whose `page_index` matches the current page. When the normalized text of an outline title appears in a text block at or near the top of the region, that block's inferred heading level can be set to the outline item's depth. This cross-reference is a heuristic and should be reported with a confidence field rather than applied silently, since rendered heading text may differ from the outline title through abbreviation, line wrapping, or font substitution. Nevertheless, it substantially improves structure inference for documents that lack tagged PDF or explicit heading role markup.
---
## 6. URI Actions and Hyperlink Annotations
In-page hyperlinks are stored as link annotations (`/Subtype /Link`) in each page's `/Annots` array. Each annotation has a `/Rect` defining its bounding box on the page and either a `/Dest` or `/A` entry for its target.
For external hyperlinks, the action is `/S /URI` with a `/URI` string. pdftract extracts the URL and determines the anchor text by finding the text content within the annotation's `/Rect` on that page — the text spans whose bounding boxes overlap the annotation rectangle constitute the visible link text. This spatial join requires that text extraction has already produced positioned text runs before annotation extraction runs; pdftract's pipeline should process annotations in a second pass after text geometry is established.
For internal links, the action is `/S /GoTo` or the `/Dest` shorthand, resolved to a physical page index and page label using the same machinery as outline destinations. These are serialized as `{"type": "internal", "page_index": 12, "page_label": "5", "anchor_text": "see Figure 3"}` alongside `{"type": "external", "url": "https://example.com", "anchor_text": "specification"}` for URI links.
pdftract should expose per-page annotation arrays in its output, each entry containing `type`, `rect` (normalized to user-space coordinates), `anchor_text`, and the destination or URL. This allows consumers to reconstruct hyperlink graphs, validate internal cross-references, and render interactive overlays without re-parsing the PDF.
---
## Implementation Priorities
Page label extraction is relatively self-contained and should be implemented early since it enriches every other output field. The outline tree walk and destination resolver share infrastructure with the named destination resolver needed for link annotations, so these should be built together. The heading-inference cross-reference between outline titles and text blocks is the most heuristic component and belongs in a post-processing pass that can be toggled independently. Together, this navigation layer gives pdftract output that is immediately useful for document indexing, accessibility tooling, and structured content pipelines.