Add research: xref parsing, object model, font descriptors, PDF/UA-2
Four new extraction research documents covering cross-reference table and xref stream parsing with error recovery, PDF object model and lexer correctness (all 8 types, string escapes, stream /Length recovery), FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT), and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization, new structure types, artifact classification improvements). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
6c6ec6a4ca
commit
16cb1bd61d
4 changed files with 445 additions and 0 deletions
144
docs/research/font-descriptor-and-metrics.md
Normal file
144
docs/research/font-descriptor-and-metrics.md
Normal file
|
|
@ -0,0 +1,144 @@
|
|||
# Font Descriptors, Font Metrics, and Embedded Font Data
|
||||
|
||||
## Overview
|
||||
|
||||
PDF's font system is layered: a font dictionary in the content stream references a FontDescriptor, which in turn may reference an embedded font stream. For text extraction, the goal is not to render glyphs but to recover the correct Unicode string for each character code and to compute accurate bounding boxes. This document describes what pdftract must read from each layer of that structure, and why.
|
||||
|
||||
---
|
||||
|
||||
## 1. The FontDescriptor Dictionary
|
||||
|
||||
Every font other than the 14 standard Type 1 fonts should have a FontDescriptor dictionary, referenced by the `/FontDescriptor` key in the font dictionary. The descriptor consolidates the typographic properties of a font in one place, independent of the specific glyph outlines.
|
||||
|
||||
The key fields and their relevance to pdftract:
|
||||
|
||||
- **`/FontName`** (name): The PostScript name of the font, possibly prefixed with a six-character subset tag (e.g., `ABCDEF+Helvetica`). Used for font identification in output metadata and for triggering fallback metric lookups when the font is not embedded.
|
||||
- **`/Flags`** (integer): A 32-bit bitfield encoding font classification. Discussed in detail in the next section. Critical for encoding selection logic.
|
||||
- **`/FontBBox`** (rectangle: `[llx lly urx ury]`): The bounding box of the full glyph set in glyph space (1/1000 of a text unit for most fonts). Provides a conservative bounding rectangle when per-glyph metrics are unavailable.
|
||||
- **`/ItalicAngle`** (number): The angle of italic slant in degrees counterclockwise from vertical. Useful for flagging italic text in output annotations, not for extraction itself.
|
||||
- **`/Ascent`** (number): The maximum height above the baseline for Latin uppercase letters, in glyph space units. Together with `/Descent`, defines the typographic line box. pdftract uses these to compute the vertical extent of text runs when constructing bounding boxes for extracted spans.
|
||||
- **`/Descent`** (number): The maximum depth below the baseline, typically a negative number. Paired with `/Ascent` to compute line height.
|
||||
- **`/CapHeight`** (number): The height of a flat capital letter (e.g., H) above the baseline. Used as a tighter ascent estimate for all-capital text and for normalizing font size comparisons across font families.
|
||||
- **`/XHeight`** (number): The height of a lowercase x above the baseline. Useful for distinguishing small-cap or lowercase text in vertical clustering during paragraph assembly.
|
||||
- **`/StemV`** (number): The dominant vertical stem thickness. Not needed for extraction; present for rendering hints.
|
||||
- **`/StemH`** (number): The dominant horizontal stem thickness. Not needed for extraction.
|
||||
|
||||
For pdftract, the operationally important fields are `/Ascent`, `/Descent`, `/CapHeight`, and `/FontBBox` for bounding box computation, and `/FontName` and `/Flags` for identification and encoding recovery.
|
||||
|
||||
---
|
||||
|
||||
## 2. Font Flags Bitfield
|
||||
|
||||
The `/Flags` integer encodes font classification as a bitfield (bit 1 is the least significant bit). Each bit signals a typographic category:
|
||||
|
||||
| Bit | Name | Meaning |
|
||||
|-----|------|---------|
|
||||
| 1 | FixedPitch | Monospaced font; all advance widths are equal |
|
||||
| 2 | Serif | Glyphs have serifs |
|
||||
| 3 | Symbolic | Font uses its own encoding or glyph set not in the Adobe standard set |
|
||||
| 4 | Script | Glyphs resemble cursive/handwritten forms |
|
||||
| 6 | Nonsymbolic | Font uses the standard Latin character set |
|
||||
| 7 | Italic | Glyphs are slanted |
|
||||
| 17 | AllCap | Font contains only uppercase glyphs |
|
||||
| 18 | SmallCap | Font uses small capitals for lowercase letters |
|
||||
| 19 | ForceBold | Bold strokes should be synthesized if weight is bold |
|
||||
|
||||
The most consequential pair for extraction is **Symbolic (bit 3) vs. Nonsymbolic (bit 6)**. These bits are mutually exclusive by spec. Their value governs which encoding pdftract applies when no explicit `/Encoding` entry or `/ToUnicode` map is present in the font dictionary:
|
||||
|
||||
- **Nonsymbolic**: The font uses Standard Latin encoding. Character codes outside an explicit `/Encoding` array fall back to the Adobe Standard Latin set, from which Unicode values can be inferred by glyph name.
|
||||
- **Symbolic**: The font has its own private encoding. Without a `/ToUnicode` CMap or an explicit `/Encoding` array, character codes must be interpreted through the font's built-in encoding, which may only be recoverable by parsing the embedded font program itself.
|
||||
|
||||
pdftract evaluates these bits after exhausting higher-priority sources (ToUnicode, then Encoding), using them to select the correct fallback decoding path.
|
||||
|
||||
---
|
||||
|
||||
## 3. Embedded Font Streams
|
||||
|
||||
The FontDescriptor may reference one of three font stream keys, each covering a different font format:
|
||||
|
||||
- **`/FontFile`**: Contains a Type 1 (PostScript) font program in PFB or PFA format. The stream contains the font's encoding vector, Private dict, and charstrings.
|
||||
- **`/FontFile2`**: Contains a TrueType font program in the sfnt binary format.
|
||||
- **`/FontFile3`**: Contains either an OpenType/CFF, CFF-only (Type1C), or CIDFont Type 0C font. The `/Subtype` entry inside the stream dictionary identifies the exact format: `/Type1C` for CFF-wrapped Type 1, `/CIDFontType0C` for CFF-based CID fonts, and `/OpenType` for a full OpenType wrapper.
|
||||
|
||||
pdftract detects which key is present to determine the parsing path. The presence of a font stream does not change the ToUnicode priority, but it provides a fallback encoding source when `/ToUnicode` is absent and the `/Encoding` array is incomplete or missing.
|
||||
|
||||
---
|
||||
|
||||
## 4. Type 1 Font Programs
|
||||
|
||||
A PFB file consists of two segments: an ASCII section containing the font header and `/Encoding` array, and a binary section containing the Private dict and charstrings. A PFA file is entirely ASCII, with the binary segment hex-encoded.
|
||||
|
||||
For text extraction, pdftract need not decode Type 1 charstrings (which would require executing a stack-based virtual machine to recover glyph outlines). The encoding vector in the ASCII segment — an array of 256 glyph names indexed by character code — is sufficient. Each glyph name can be mapped to a Unicode code point via the Adobe Glyph List. This encoding vector supplements or overrides the PDF-level `/Encoding` array if the latter is incomplete.
|
||||
|
||||
The `/Encoding` array in the PDF font dictionary takes precedence; the embedded font's own encoding is a secondary source and only consulted when the PDF-level encoding has gaps.
|
||||
|
||||
---
|
||||
|
||||
## 5. TrueType Font Programs
|
||||
|
||||
A TrueType font is an sfnt container: a binary file with a table directory followed by named binary tables. The tables relevant to pdftract are:
|
||||
|
||||
- **`cmap`**: Maps character codes to glyph IDs. Multiple subtables may be present; pdftract prefers the Platform 3 / Encoding 1 (Windows Unicode BMP) or Platform 0 (Unicode) subtables. For a TrueType font embedded in a non-CID context, the cmap supplements the PDF `/Encoding` array and provides a Unicode fallback path when glyph names are available.
|
||||
- **`hmtx`**: Horizontal metrics table. Contains `advanceWidth` and `leftSideBearing` for each glyph ID. pdftract uses these to cross-validate the PDF-level `/Widths` array. When the `/Widths` array is absent or malformed, `hmtx` provides the authoritative advance widths in font units (normalized by `unitsPerEm` from the `head` table).
|
||||
- **`OS/2`**: Contains `sTypoAscender`, `sTypoDescender`, `sCapHeight`, and `sxHeight` — the typographic equivalents of the FontDescriptor fields. When the PDF FontDescriptor is absent or its ascent/descent values are zero (a known authoring bug in some PDF producers), pdftract falls back to these OS/2 values for bounding box computation. `usWeightClass` provides weight information for output metadata.
|
||||
|
||||
---
|
||||
|
||||
## 6. CFF (Compact Font Format / Type1C)
|
||||
|
||||
CFF encodes one or more Type 1 fonts in a compact binary format. It appears as the glyph data in both `/FontFile3` with `/Subtype /Type1C` and inside OpenType CFF fonts.
|
||||
|
||||
The CFF structure relevant to pdftract:
|
||||
|
||||
- **Top DICT**: Contains the font's encoding, charset, and offsets to other data structures. For non-CID CFF fonts, the charset maps glyph indices to SIDs (string IDs), from which glyph names are recovered. Glyph names then map to Unicode via the Adobe Glyph List.
|
||||
- **FDArray** (for CID fonts): CID-keyed CFF fonts organize glyphs into font dictionaries (FD), each with its own Private DICT. There is no charset-to-name mapping; glyph indices are CIDs. For these fonts, the `/ToUnicode` CMap is the only reliable Unicode source — pdftract does not attempt to derive Unicode from raw CIDs without a ToUnicode map.
|
||||
- **CharStrings index**: Contains the charstring (glyph program) for each glyph. pdftract does not execute charstrings; their presence only confirms that the font is fully embedded and not a subset with missing glyphs.
|
||||
|
||||
---
|
||||
|
||||
## 7. OpenType Fonts
|
||||
|
||||
OpenType is a superset of sfnt that wraps either TrueType outlines or CFF outlines. The two flavors are:
|
||||
|
||||
- **CFF-flavored OpenType** (`.otf`): Contains a `CFF ` table instead of `glyf`/`loca`. The sfnt structure, cmap, hmtx, and OS/2 tables are present as in TrueType. pdftract reads the `CFF ` table for glyph names when cmap lookups are insufficient.
|
||||
- **TTF-flavored OpenType** (`.ttf`): Identical to TrueType from an extraction standpoint.
|
||||
|
||||
The **GSUB** (Glyph Substitution) table is relevant for ligature validation. GSUB lookup type 4 (Ligature Substitution) records which sequences of glyph IDs are contracted into a single ligature glyph. When a ToUnicode map assigns a multi-character string to a single character code, pdftract can cross-reference GSUB to confirm that the sequence is a known ligature, strengthening confidence in the ToUnicode mapping.
|
||||
|
||||
---
|
||||
|
||||
## 8. Multiple Master and Variable Fonts
|
||||
|
||||
Multiple Master fonts define a parameter space (axes such as weight and width) from which instances are interpolated. Variable fonts (OpenType 1.8+) use the `fvar`, `gvar`, and related tables. Both are rare in PDFs; when encountered, pdftract treats the font using the default or normalized instance. No axis interpolation is required — the character-to-Unicode mapping is invariant across instances, and metric values from the FontDescriptor or OS/2 table apply to the default instance without further computation.
|
||||
|
||||
---
|
||||
|
||||
## 9. Font Substitution Artifacts
|
||||
|
||||
When a font is not embedded and the PDF viewer substitutes a visually similar font at render time, the substitution affects only the rendered glyph shapes — not the character codes in the content stream. The ToUnicode CMap and the Encoding array remain tied to the original font's character codes. pdftract's extraction pipeline reads character codes from the content stream and resolves them through ToUnicode or Encoding, entirely bypassing the render-time substitution. Extracted text is therefore unaffected by font substitution, though extracted bounding boxes may be slightly inaccurate if the PDF's `/Widths` array was generated for the original unembedded font and the substitute has different metrics.
|
||||
|
||||
---
|
||||
|
||||
## 10. Font Name Normalization
|
||||
|
||||
Many embedded fonts carry a subset prefix: a six-character uppercase tag followed by a `+` separator, as in `ABCDEF+TimesNewRomanPS-BoldMT`. pdftract strips this prefix using a regex match on `/^[A-Z]{6}\+/` before using the base name.
|
||||
|
||||
The base name is then normalized for font identification in output:
|
||||
|
||||
1. **PostScript name parsing**: Hyphens separate the family name from style modifiers (e.g., `Helvetica-BoldOblique` → family `Helvetica`, style `Bold Oblique`).
|
||||
2. **Family equivalence mapping**: A lookup table maps common alias pairs to canonical families — `Arial` → `Helvetica`, `TimesNewRoman` → `Times`, `CourierNew` → `Courier`. This mapping is used for fallback metric lookup only; pdftract preserves the original `/FontName` in its output metadata.
|
||||
3. **Standard font fallback**: If no font stream is present and the normalized base name matches one of the 14 standard Type 1 fonts, pdftract uses the corresponding built-in metric table (ascent, descent, cap height, per-glyph widths) rather than returning zero values.
|
||||
|
||||
The normalized font name is emitted in pdftract's per-span output as `font_family`, alongside `font_size`, `is_bold`, `is_italic`, and `is_monospace` (derived from the FixedPitch flag). This gives downstream consumers enough information to reconstruct basic typography without access to the PDF itself.
|
||||
|
||||
---
|
||||
|
||||
## Summary of pdftract Reading Priorities
|
||||
|
||||
| Purpose | Primary Source | Secondary Source | Tertiary Source |
|
||||
|---------|---------------|-----------------|----------------|
|
||||
| Unicode mapping | ToUnicode CMap | /Encoding + Glyph List | Embedded font encoding vector |
|
||||
| Advance widths | /Widths array | hmtx (TrueType/OT) | FontDescriptor /FontBBox width |
|
||||
| Ascent/Descent | FontDescriptor /Ascent, /Descent | OS/2 sTypoAscender/Descender | /FontBBox [lly, ury] |
|
||||
| Font identification | /FontName (stripped) | /BaseFont | Embedded font name record |
|
||||
| Encoding fallback | /Flags Symbolic/Nonsymbolic | Embedded font /Encoding | Standard Latin defaults |
|
||||
120
docs/research/pdf-object-model-and-data-types.md
Normal file
120
docs/research/pdf-object-model-and-data-types.md
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
# PDF Object Model and Data Types
|
||||
|
||||
**Reference for pdftract's PDF lexer and object parser**
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The PDF specification defines eight fundamental object types that together form the basis of every PDF document: Boolean, Integer, Real, String, Name, Array, Dictionary, Stream, and Null. A correct PDF parser must handle every syntactic variant of each type, including edge cases that the specification acknowledges but that real-world generators produce in unexpected ways. This document provides a precise reference for implementing the pdftract lexer and object parser such that extraction never fails due to parsing ambiguity or malformed object encoding.
|
||||
|
||||
---
|
||||
|
||||
## 1. Boolean and Null
|
||||
|
||||
The Boolean and Null objects are keyword literals. `true` and `false` represent Boolean values; `null` represents the Null object. All three are case-sensitive — `True`, `FALSE`, and `Null` are not valid PDF keywords. A Null value in a dictionary is distinct from an absent key: an absent key typically implies a default value defined by the specification, whereas an explicit `null` value means the key is present but its value is the Null object. Parsers must preserve this distinction rather than treating both cases identically.
|
||||
|
||||
---
|
||||
|
||||
## 2. Integer and Real Numbers
|
||||
|
||||
Integer objects have no decimal point and may be preceded by an optional `+` or `-` sign. Real objects contain a decimal point, such as `3.14` or `-0.5`. The PDF specification states that integers fit within 32-bit signed range and reals within 32-bit float precision; however, documents produced by non-conforming generators routinely exceed these bounds. pdftract should store integers as `i64` and reals as `f64` to avoid truncation or overflow silently corrupting extracted data.
|
||||
|
||||
Integer/real disambiguation during lexing is straightforward: the presence of a `.` character in the token makes it a real. Some generators also emit real values using scientific notation (`1.5e3`), which the specification does not sanction but which appears in files produced by certain software. A robust lexer must recognize this form and parse it correctly.
|
||||
|
||||
Negative zero (`-0`) is syntactically valid. As an integer it should parse to zero. As a real it may produce negative zero in IEEE 754, which is semantically equivalent to positive zero for PDF purposes but must not cause a parse error.
|
||||
|
||||
---
|
||||
|
||||
## 3. String Objects
|
||||
|
||||
PDF strings come in two syntactic forms: literal strings and hexadecimal strings.
|
||||
|
||||
**Literal strings** are enclosed in parentheses. The content may include any byte except unescaped unbalanced parentheses. Parentheses may be nested without escaping, provided they remain balanced — `(foo (bar) baz)` is a single valid string containing the inner parentheses. Backslash escape sequences allow the inclusion of special characters:
|
||||
|
||||
- `\n` — LINE FEED (0x0A)
|
||||
- `\r` — CARRIAGE RETURN (0x0D)
|
||||
- `\t` — HORIZONTAL TAB (0x09)
|
||||
- `\\` — backslash
|
||||
- `\(` and `\)` — literal parentheses, bypassing balance counting
|
||||
- `\ddd` — one to three octal digits representing a byte value (e.g., `\101` = `A`)
|
||||
- `\` followed by a newline — line continuation; the backslash and newline are both ignored, allowing long strings to span source lines
|
||||
|
||||
A backslash followed by any character not in the above list should be treated as that character alone (the backslash is discarded). Parsers must handle the case where a `\ddd` octal sequence has fewer than three digits before the next non-octal character.
|
||||
|
||||
**Hexadecimal strings** are enclosed in angle brackets: `<4F6E65>`. Each pair of hex digits represents one byte. Whitespace between digit pairs is permitted and must be ignored. If the total number of hex digits is odd, the final digit is treated as if followed by a zero — `<4F6>` decodes as bytes `0x4F` and `0x60`. Case is not significant for hex digits.
|
||||
|
||||
**UTF-16 strings** appear in PDF 1.7 for text that cannot be represented in PDFDocEncoding. They begin with the big-endian byte-order mark `FE FF`. When the parser encounters this BOM at the start of a string (literal or hex-decoded), it must decode the remainder as UTF-16BE rather than as PDFDocEncoding. A robust implementation converts these to Rust's native UTF-8 during parsing, surfacing any malformed surrogate pairs as replacement characters rather than panicking.
|
||||
|
||||
---
|
||||
|
||||
## 4. Name Objects
|
||||
|
||||
Name objects begin with a forward slash and are followed by regular characters. The slash itself is not part of the name's value. Names are case-sensitive: `/Type` and `/type` are different names.
|
||||
|
||||
Within a name, the sequence `#xx` — where `xx` is exactly two hexadecimal digits — is a hex escape representing the byte with that value. This allows names to contain bytes that would otherwise be delimiter or whitespace characters. Hex escapes are case-insensitive. The byte `#00` (the null byte) is syntactically valid inside a name and must not cause the parser to truncate the name at that point. In Rust, names must be stored as `Vec<u8>` or a newtype wrapper rather than `String`, since Rust strings cannot contain interior null bytes. Callers needing a string representation can use a lossy conversion or handle the null as an escaped sequence.
|
||||
|
||||
The PDF specification imposes no hard limit on name length, but the recommended maximum for interoperability is 127 bytes. In practice, generators may produce longer names; the parser should not truncate them.
|
||||
|
||||
---
|
||||
|
||||
## 5. Array and Dictionary Objects
|
||||
|
||||
Arrays are delimited by `[` and `]`. Elements are any object types, separated by whitespace. Arrays may be nested to arbitrary depth. Array elements may include indirect references (described below), allowing deferred resolution.
|
||||
|
||||
Dictionaries are delimited by `<<` and `>>`. Each entry consists of a Name key followed by a value of any object type. Keys must be Name objects; values may be any object, including nested dictionaries, arrays, or indirect references. A dictionary containing no entries — whitespace only between `<<` and `>>` — is a valid empty dictionary and must not be treated as a parse error. Dictionary keys should be deduplicated in the parser's output; the PDF specification states that duplicate keys produce undefined behavior, but in practice the last occurrence typically wins.
|
||||
|
||||
Both structures may be deeply nested. The parser must not use a fixed-depth recursion limit that would cause it to reject valid (if pathological) documents. Iterative parsing using an explicit stack is preferable to naive recursion for production use.
|
||||
|
||||
---
|
||||
|
||||
## 6. Stream Objects
|
||||
|
||||
A stream object always consists of a dictionary followed by the keyword `stream`, the stream data, and the keyword `endstream`. The line ending after `stream` must be either a single LINE FEED or a CARRIAGE RETURN followed by a LINE FEED — a standalone CR is not permitted, though lenient parsers may accept it. The stream data is binary and extends for exactly as many bytes as specified by the `/Length` entry in the preceding dictionary.
|
||||
|
||||
The `/Length` key is authoritative for determining the extent of the stream body. After reading `/Length` bytes, the parser skips optional whitespace and expects the `endstream` keyword. Common `/Length` errors in real-world PDFs include:
|
||||
|
||||
- **Off-by-one** errors where the declared length is one byte too long or too short relative to `endstream`
|
||||
- **CRLF miscounts** where the generator counted a newline as one byte but stored two
|
||||
- **Zero-length streams** which are valid and must not cause errors
|
||||
- **Missing /Length** which requires scanning forward for `endstream` as a recovery strategy
|
||||
|
||||
A correct recovery implementation scans forward from the `stream` keyword to find `endstream`, uses the actual byte count as the effective length, and emits a recoverable parse warning. This ensures that a /Length error does not cause extraction failure for the remainder of the document.
|
||||
|
||||
---
|
||||
|
||||
## 7. Indirect References
|
||||
|
||||
An indirect reference takes the form `N G R`, where `N` is the object number, `G` is the generation number, and `R` is the literal keyword. For example, `12 0 R` refers to object 12 at generation 0. Indirect references may appear wherever any object may appear: as dictionary values, array elements, or standalone objects. The object number and generation number are non-negative integers.
|
||||
|
||||
Resolution of an indirect reference requires a lookup in the cross-reference table (xref). The xref maps object number and generation to a byte offset within the file where the corresponding `N G obj` ... `endobj` block begins. Circular references are technically invalid but must not cause the parser to loop indefinitely; a resolution depth limit or a visited-objects set provides a safe guard.
|
||||
|
||||
---
|
||||
|
||||
## 8. Comments
|
||||
|
||||
A `%` character outside of a string or stream begins a comment that extends to the end of the line. Comments may appear between any tokens and are treated as whitespace by the parser.
|
||||
|
||||
The conventional PDF header `%PDF-1.x` is followed on the next line by a comment containing four bytes with values above 127, such as `%âãÏÓ`. This high-bit-byte convention signals to FTP clients that the file is binary and should not be transferred in text mode. The lexer should handle this comment correctly rather than treating the high bytes as malformed input.
|
||||
|
||||
---
|
||||
|
||||
## 9. Parser Correctness: Edge Cases
|
||||
|
||||
Several lexical edge cases deserve specific attention during implementation:
|
||||
|
||||
**Whitespace** in PDF includes space (0x20), horizontal tab (0x09), carriage return (0x0D), line feed (0x0A), form feed (0x0C), and null (0x00). The null byte as whitespace is rare but must not terminate tokenization prematurely.
|
||||
|
||||
**Delimiter characters** — `(`, `)`, `<`, `>`, `[`, `]`, `{`, `}`, `/`, `%` — terminate a token without consuming the delimiter itself. The `<<` and `>>` tokens must be distinguished from a single `<` or `>`, which denotes a hex string or comparison operator in PostScript but not in PDF object streams.
|
||||
|
||||
**Very long strings or names** must not overflow fixed buffers. The parser should accumulate token content into a growable structure (e.g., Rust's `Vec<u8>`) without imposing an arbitrary size ceiling.
|
||||
|
||||
**Malformed hex strings** with an odd digit count are recoverable by appending a trailing `0` before decoding. A hex string containing non-hex, non-whitespace characters inside the `<>` delimiters is malformed; the parser should emit a warning and skip the invalid character, treating the remaining digits as the string's content.
|
||||
|
||||
**Empty dictionaries** (`<< >>`) are valid. Parsers that expect at least one key-value pair before `>>` will incorrectly reject them.
|
||||
|
||||
**Negative zero** (`-0`) is syntactically valid for both integer and real tokens. Integer negative zero should produce `0i64`. Real negative zero should produce `-0.0f64` without error.
|
||||
|
||||
---
|
||||
|
||||
*This document is a reference for pdftract's lexer and object parser implementation. Correctness at the syntax level — particularly for edge cases involving escape sequences, /Length discrepancies, null bytes in names, and deeply nested structures — is foundational to reliable text extraction across the full range of real-world PDF files.*
|
||||
61
docs/research/pdfua2-and-accessibility-standards.md
Normal file
61
docs/research/pdfua2-and-accessibility-standards.md
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# PDF/UA-2, WCAG Alignment, and Next-Generation Accessibility Standards
|
||||
|
||||
## Overview
|
||||
|
||||
PDF/UA-2 (ISO 14289-2) represents a significant architectural departure from its predecessor, anchored to PDF 2.0 (ISO 32000-2) rather than PDF 1.7. For a text extraction library like pdftract, this matters because the structural foundations that govern how content is tagged, ordered, and described have been systematically improved. A pdftract extraction pipeline that already handles PDF/UA-1 correctly is well-positioned to support PDF/UA-2 — the incremental work centers on namespace resolution, MathML extraction, Unicode normalization, and a more precise handling of artifact classification.
|
||||
|
||||
---
|
||||
|
||||
## PDF/UA-2: Key Changes from PDF/UA-1
|
||||
|
||||
The most consequential structural change in PDF/UA-2 is the adoption of the PDF 2.0 namespace mechanism for structure element tag names. In PDF/UA-1, structure types like `P`, `H1`, `Table`, and `Figure` were drawn from a flat global namespace defined by the PDF specification. PDF/UA-2 requires that every structure element be namespace-qualified, binding tag names to a specific namespace URI. The standard namespace for PDF 2.0 structure types is `http://iso.org/pdf2/ssn`. Processors that assume a flat namespace will misidentify or drop elements in conforming PDF/UA-2 documents, so pdftract must resolve the `/NS` dictionary on each structure element and apply namespace-aware tag matching rather than bare string comparison.
|
||||
|
||||
Artifact classification in PDF/UA-2 is substantially more granular. PDF/UA-1 recognized artifact subtypes of `Layout`, `Page`, and `Pagination`, but the classification criteria were loosely specified. PDF/UA-2 formalizes these subtypes and adds `Background` as an explicit artifact subtype for purely decorative content — content that conveys no information and should be excluded from any logical reading order. The `/BBox` attribute on artifact dictionaries now carries a specific meaning: it defines the bounding box of the artifact in page coordinates, which enables pdftract to spatially exclude artifactual content during extraction without relying solely on the type label. The `/AttachedTop` attribute indicates whether a page artifact (such as a header or footer region) is anchored to the top of the page, providing layout semantics that pdftract can use when reconstructing reading order. For extraction purposes, pdftract should filter Background artifacts entirely from text output and should handle Page and Pagination artifacts as configurable — either excluded by default or surfaced in a separate metadata channel.
|
||||
|
||||
Unicode normalization requirements are made explicit in PDF/UA-2: all text content must be in NFC (Canonical Decomposition followed by Canonical Composition). This is a hard requirement, not a recommendation. In practice, many legacy PDFs — particularly those produced by Arabic or Hebrew typesetting systems — emit text in NFD or NFKD form, where combining characters appear as separate code points following their base character. pdftract must apply NFC normalization to all extracted text strings as a post-processing step regardless of the document's claimed conformance level, since the consistency guarantee matters for downstream consumers even when the source PDF predates UA-2.
|
||||
|
||||
Language tagging requirements are also tightened in PDF/UA-2. The `/Lang` entry must be a valid BCP 47 language tag, and inheritance rules are more strictly defined: a structure element without a `/Lang` entry inherits the language of its nearest ancestor that carries one, ultimately falling back to the document-level `/Lang` in the document catalog. pdftract should validate inherited language tags when processing PDF/UA-2 documents and surface the resolved language for each extracted content run, rather than only the document-level default. Invalid or absent language tags should be flagged in the extraction metadata, since they constitute an accessibility violation that affects how downstream TTS engines and screen readers interpret the content.
|
||||
|
||||
---
|
||||
|
||||
## PDF 2.0 Structure Improvements
|
||||
|
||||
PDF 2.0 removed a number of structure types that had become ambiguous or were poorly supported in practice. Deprecated types from PDF 1.7 — including `BlockQuote`, `Caption` used outside its defined context, and several others — are no longer valid in the standard structure namespace. In their place, PDF 2.0 introduced several new types: `DocumentFragment` for embedded sub-documents, `Aside` for supplementary or tangentially related content, `Title` as a dedicated type distinct from heading levels, `FENote` for footnotes and endnotes, and `Sub` for inline subexpression content. These additions give pdftract more precise semantic signals during extraction. An `Aside` element, for instance, should be extractable but may warrant a different confidence weight in reading-order heuristics, since asides are by definition non-linear content. `FENote` provides a clean hook for extracting footnote content with its source anchor, rather than having to infer footnote structure from spatial positioning.
|
||||
|
||||
pdftract's handling of PDF 2.0 structure types should begin with a namespace-aware type resolution step. When the `/NS` dictionary on a structure element references a known namespace URI, the tag name is interpreted in that namespace's vocabulary. If the namespace is unrecognized, pdftract should treat the element as an application-defined extension and fall back to extracting its text content without semantic classification. This ensures forward compatibility: future namespaces will not cause extraction failures, only a loss of type-specific enrichment.
|
||||
|
||||
---
|
||||
|
||||
## MathML in PDF 2.0
|
||||
|
||||
PDF 2.0 introduces first-class support for mathematical content via the MathML namespace (`http://www.w3.org/1998/Math/MathML`). When a structure element's `/NS` entry references the MathML namespace, the element subtree represents a MathML expression rather than a PDF structure type. The glyph content rendered to the page is still present in the content stream — PDF must remain renderable without MathML support — but the MathML subtree carries the full semantic meaning of the expression: operator precedence, variable binding, and mathematical relationships that are entirely absent from the rendered glyph sequence.
|
||||
|
||||
For pdftract, the extraction strategy for mathematical content should prefer MathML when present. The MathML subtree can be serialized as a self-contained MathML fragment and included in the extraction output as a dedicated content block, with the associated page glyphs available as a fallback representation. Attempting to reconstruct mathematical meaning from glyph sequences alone is fragile: ligatures, spacing glyphs, and operator symbols used in typeset mathematics do not map reliably to semantic mathematical intent. MathML extraction sidesteps this problem entirely by reading the semantic annotation that the PDF author has already encoded. pdftract's extraction pipeline should identify structure elements carrying a MathML namespace, serialize the full MathML subtree, and emit it as a typed content block alongside positional metadata.
|
||||
|
||||
---
|
||||
|
||||
## WCAG 2.1 and PDF Techniques
|
||||
|
||||
The PDF-specific techniques in WCAG 2.1 — PDF1 through PDF23 — map directly onto features that PDF/UA-2 either requires or formalizes. PDF1 (applying text alternatives to images) corresponds to the `/Alt` attribute on `Figure` elements; PDF2 (bookmark navigation) corresponds to the document outline; PDF11 and PDF12 address form field accessibility; PDF17 covers consistent heading structure. PDF/UA-2 does not merely align with these techniques — for conforming documents, it mandates the underlying structural features that make those techniques achievable.
|
||||
|
||||
pdftract's confidence scoring system can surface WCAG-relevant signals as part of its extraction output. Structure elements carrying `/Alt` text, correctly ordered heading hierarchies, explicit language tags, and proper artifact classification all contribute to an accessible document. When these signals are present and well-formed, pdftract can report high confidence in the semantic accuracy of extracted content. When they are absent or malformed — a `Figure` without `/Alt`, a heading sequence that skips levels, a document with no language tag — pdftract can report reduced confidence and flag specific accessibility gaps. This is not a full WCAG audit, but it gives downstream consumers actionable metadata about the reliability of the extraction and the accessibility posture of the source document.
|
||||
|
||||
---
|
||||
|
||||
## Associated Files in PDF 2.0
|
||||
|
||||
PDF 2.0 extends the `/AF` (Associated Files) key beyond page and XObject dictionaries to structure elements themselves. An `AF` array on a structure element can reference embedded files that are semantically associated with that element's content — for example, a source spreadsheet linked to a `Table` structure element, or a data file associated with a `Figure`. pdftract should traverse the `/AF` arrays on structure elements during extraction and surface associated file metadata — including the file relationship type specified in the `/AFRelationship` key — as part of the element's extracted output. The actual file content can be optionally extracted and written to a sidecar path or included as base64 in structured output formats. This is particularly valuable for data-rich documents where the associated files contain the machine-readable source underlying rendered content.
|
||||
|
||||
---
|
||||
|
||||
## Phoneme Metadata
|
||||
|
||||
PDF/UA-2 allows `/Phoneme` attributes on structure elements to provide pronunciation hints for text-to-speech engines. These attributes carry phonemic transcriptions in a format specified by the document's `/PhoneticAlphabet` entry. pdftract can surface phoneme attributes as supplementary metadata on extracted content spans without requiring any TTS capability itself. Downstream consumers that feed extracted text into speech synthesis pipelines benefit from having these hints available in the extraction output, since they encode the document author's explicit pronunciation intent for ambiguous terms, abbreviations, and proper nouns.
|
||||
|
||||
---
|
||||
|
||||
## Backwards Compatibility and the pdftract Upgrade Path
|
||||
|
||||
A pdftract pipeline that correctly handles PDF/UA-1 already covers the structural fundamentals: logical structure tree traversal, reading order reconstruction from structure order rather than content stream order, artifact filtering, and `/Alt` text extraction for non-text content. What PDF/UA-2 adds is a defined set of extensions to that foundation.
|
||||
|
||||
The concrete additions required are: (1) namespace-aware structure type resolution using the `/NS` dictionary, replacing bare string tag matching; (2) MathML subtree serialization when the MathML namespace is detected on a structure element; (3) NFC normalization applied to all extracted text, regardless of document conformance level; (4) BCP 47 validation and inheritance resolution for `/Lang` entries; (5) Background artifact filtering using the formally defined subtype; (6) `/BBox` and `/AttachedTop` consumption on artifact dictionaries for spatial exclusion; (7) associated file extraction via `/AF` arrays on structure elements; and (8) phoneme attribute surfacing as extraction metadata. None of these changes are in conflict with the PDF/UA-1 handling path — they are additive. A pdftract binary that implements all eight extensions will correctly extract content from PDF/UA-1, PDF/UA-2, and non-conforming PDF 2.0 documents, degrading gracefully where conformance features are absent.
|
||||
120
docs/research/xref-table-parsing-and-object-lookup.md
Normal file
120
docs/research/xref-table-parsing-and-object-lookup.md
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
# Cross-Reference Table Parsing and Indirect Object Resolution
|
||||
|
||||
## Overview
|
||||
|
||||
Every PDF file is a collection of numbered, versioned objects — dictionaries, streams, arrays, and scalars — that are connected by indirect references of the form `N G R`, where N is the object number and G is the generation number. Reliably resolving those references to byte positions in the file is the foundation of any PDF parser. The cross-reference table (or its stream equivalent) is the mechanism that provides this map. pdftract must handle every variant of this mechanism, including corrupted files where no valid table exists at all.
|
||||
|
||||
---
|
||||
|
||||
## 1. Traditional Cross-Reference Tables
|
||||
|
||||
A traditional xref table begins with the keyword `xref` on its own line, followed by one or more subsections. Each subsection opens with a pair of integers on a single line — the first object number in the subsection and the count of consecutive entries — then a sequence of exactly that many 20-byte fixed-width entries.
|
||||
|
||||
Each entry has the form:
|
||||
|
||||
```
|
||||
nnnnnnnnnn ggggg n \r\n
|
||||
```
|
||||
|
||||
The ten-digit field is the byte offset of the object body from the beginning of the file (for in-use entries) or the object number of the next free object in the free list (for free entries). The five-digit field is the generation number. The single character flag is either `n` (in-use) or `f` (free). The two-byte end-of-line sequence is either `\r\n` or ` \n` (space + newline) — both are valid per the specification, and pdftract must accept either.
|
||||
|
||||
Object 0 is always the head of the free list and carries generation 65535. Its offset field holds the object number of the next free object, or 0 if the list is empty.
|
||||
|
||||
The xref table is followed immediately by a trailer dictionary, introduced by the `trailer` keyword. The trailer dictionary carries several mandatory or important keys:
|
||||
|
||||
- `/Size` — one greater than the highest object number present; defines the minimum size of the cross-reference table in memory.
|
||||
- `/Root` — an indirect reference to the document catalog; mandatory.
|
||||
- `/Info` — an indirect reference to the document information dictionary; optional but common.
|
||||
- `/Encrypt` — an indirect reference to the encryption dictionary; present only in encrypted files.
|
||||
- `/ID` — a two-element array of byte strings used for encryption and document identity; required when `/Encrypt` is present.
|
||||
- `/Prev` — byte offset of the previous xref table or stream, used in incremental updates.
|
||||
|
||||
After the trailer dictionary comes the `startxref` keyword followed by a byte offset, then `%%EOF`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Cross-Reference Streams (PDF 1.5+)
|
||||
|
||||
PDF 1.5 introduced cross-reference streams as an alternative to the traditional table-plus-trailer structure. A cross-reference stream is a regular PDF stream object whose dictionary doubles as the trailer dictionary and whose compressed body encodes the xref entries in binary form.
|
||||
|
||||
The stream dictionary must contain `/Type /XRef`. It also carries the same keys as a trailer dictionary (`/Size`, `/Root`, `/Info`, `/Prev`, etc.). Three additional keys control how the binary body is parsed:
|
||||
|
||||
- `/W` — an array of three integers specifying the byte widths of the three fields in each entry: field 1 (type), field 2 (offset or object stream object number), field 3 (generation number or index within object stream). A width of 0 means the field is absent and its default value applies.
|
||||
- `/Index` — an array of integer pairs `[first_obj count ...]` analogous to the subsection headers in a traditional table. If omitted, defaults to `[0 /Size]`.
|
||||
- `/Filter` — almost always `FlateDecode`; the stream body must be decompressed before parsing.
|
||||
|
||||
Each entry is a concatenation of the three fixed-width binary fields. The three entry types are:
|
||||
|
||||
- **Type 0**: free object. Field 2 is the object number of the next free object; field 3 is the generation number to use if the object is reused.
|
||||
- **Type 1**: uncompressed object at a direct byte offset. Field 2 is the byte offset from the beginning of the file; field 3 is the generation number.
|
||||
- **Type 2**: object compressed inside an object stream. Field 2 is the object number of the containing object stream; field 3 is the zero-based index of this object within that stream. Generation number is implicitly 0.
|
||||
|
||||
When `/W[0]` is 0, type 1 is the default — a common optimization in writers that emit only uncompressed objects.
|
||||
|
||||
---
|
||||
|
||||
## 3. Hybrid Reference Files
|
||||
|
||||
Some PDF writers produce documents that contain both a traditional xref table and a cross-reference stream in the same revision. In a hybrid file, the traditional table's trailer dictionary contains an `/XRefStm` key pointing to the byte offset of a cross-reference stream. The cross-reference stream, in turn, covers object entries not listed in the traditional table (typically compressed objects).
|
||||
|
||||
The resolution rule is straightforward: when both structures are present, the traditional xref table takes precedence for any object number it explicitly covers. The cross-reference stream fills in entries for object numbers not present in the table. pdftract should merge the two maps, giving priority to the traditional entries, so that a hybrid file is fully resolved without requiring the parser to choose one structure over the other.
|
||||
|
||||
---
|
||||
|
||||
## 4. Locating the xref via startxref
|
||||
|
||||
Parsing begins by scanning backward from the end of the file. The specification allows up to 1,024 bytes of garbage or comment padding after `%%EOF`; pdftract should scan the last 1,024 bytes for the final occurrence of `startxref`. The integer on the next non-whitespace line is the byte offset of the root xref structure (table or stream) for the most recent revision.
|
||||
|
||||
In incremental update chains, each revision's xref structure carries a `/Prev` key pointing to the previous revision's xref. pdftract must follow this chain to the end in order to build a complete object map, applying each revision on top of the previous one so that newer definitions shadow older ones.
|
||||
|
||||
The declared PDF version in the `%PDF-N.N` header is a lower bound. If the document catalog dictionary contains a `/Version` key, that key takes precedence and pdftract should use it to decide which features (such as object streams) are valid in this file.
|
||||
|
||||
---
|
||||
|
||||
## 5. Object Addressing Modes
|
||||
|
||||
Once the xref map is built, every object number resolves to one of two addressing modes.
|
||||
|
||||
**Type 1 — direct byte offset.** The parser seeks to the recorded offset, skips any leading whitespace, then reads the object header: `N G obj`. The content between `obj` and `endobj` is the object's value.
|
||||
|
||||
**Type 2 — inside an object stream.** The parser first resolves the containing object stream (itself a type 1 object), then locates the target object within that stream using the per-stream offset table described in the next section.
|
||||
|
||||
---
|
||||
|
||||
## 6. Object Streams (ObjStm)
|
||||
|
||||
An object stream is a compressed stream whose dictionary carries `/Type /ObjStm`, `/N` (the count of stored objects), and `/First` (byte offset from the start of the decoded stream body to the first stored object's data). The decoded stream body begins with an ASCII header consisting of `/N` pairs of integers: `objnum offset`, where `offset` is relative to the byte indicated by `/First`. After reading these `/N` pairs, pdftract has a local offset table. To retrieve the i-th object, it seeks within the decoded stream to `/First` + `offset[i]` and parses from that position. Object streams cannot contain stream objects, only non-stream objects such as dictionaries, arrays, and scalars.
|
||||
|
||||
---
|
||||
|
||||
## 7. Indirect Object Resolution
|
||||
|
||||
When the parser encounters an indirect reference `N G R`, it consults the xref map for object number N. It then follows the addressing mode to the byte offset or object stream location, reads the `N G obj` header, and verifies that the declared object number and generation number match the requested values. A mismatch — where the bytes at the recorded offset contain a different object number — usually indicates a corrupted or patched file. pdftract should log the mismatch and, for generation-number mismatches where the object number matches, return the object anyway, since generation-number cycling is rare in practice and writers sometimes emit inconsistent values.
|
||||
|
||||
---
|
||||
|
||||
## 8. Generation Numbers in Practice
|
||||
|
||||
Generation numbers exist to handle object reuse: when an object is freed and its slot is reused for a new object, the new object receives a generation number one higher than the freed object. In practice, the vast majority of PDF files never free and reuse any object slot, so every in-use object has generation number 0. pdftract should track the generation number recorded in the xref map and validate it during object lookup, but should never refuse to parse an otherwise valid object solely because of a generation mismatch. For extraction purposes, the correct behavior is to return the highest-generation object for any given object number.
|
||||
|
||||
---
|
||||
|
||||
## 9. Null Objects and the Free List
|
||||
|
||||
A reference to a free-list entry must resolve to the null object rather than triggering a parse error. The canonical null reference is `0 0 R`, which by specification always resolves to null. Any indirect reference whose xref entry is marked free (type 0) or whose object number exceeds `/Size` also resolves to null. The generation number 65535 marks a permanently freed slot that may never be reused. pdftract must handle all of these cases without panicking: the resolution function returns `Option<PdfObject>`, yielding `None` (interpreted as the null object) for any free, out-of-range, or missing entry.
|
||||
|
||||
---
|
||||
|
||||
## 10. Error Recovery: Linear Scan Fallback
|
||||
|
||||
When the xref structure is missing, truncated, or internally inconsistent — as can happen with corrupted incremental updates or files produced by non-conforming writers — pdftract falls back to a linear scan of the entire file body.
|
||||
|
||||
The scanner searches for the byte sequence `obj` preceded on the same line by two integers matching the pattern `N G`. For each candidate, it attempts to parse the object body, then records the byte offset and object-number/generation-number pair. When the same object number appears multiple times (a common artifact of corrupted incremental updates where the xref chain is broken), the definition appearing latest in the file takes precedence, since incremental updates append to the end of the file and later definitions supersede earlier ones.
|
||||
|
||||
After the scan completes, the reconstructed object table is used for all subsequent indirect reference resolution. The scan is significantly slower than xref-guided lookup, so pdftract should attempt xref parsing first and fall back only after confirming that the recorded `startxref` offset points to invalid data or that the decoded xref entries fail internal consistency checks (e.g., a type 1 entry's offset points to bytes that do not begin an `obj` header).
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Reliable object resolution in pdftract requires a layered strategy: locate `startxref` by scanning backward from `%%EOF`, parse the root xref structure (traditional table, cross-reference stream, or hybrid combination), follow `/Prev` chains to assemble the complete object map across all incremental revisions, and resolve each indirect reference through either a direct byte offset or an object stream lookup. Generation numbers and free-list entries must be handled gracefully rather than treated as hard errors. When the xref mechanism fails entirely, a linear scan of the file provides a workable fallback that makes pdftract robust against the full range of malformed files encountered in production.
|
||||
Loading…
Add table
Reference in a new issue