From 9420964b73948a318f16b3b13db6d2c864002a61 Mon Sep 17 00:00:00 2001 From: jedarden Date: Sat, 16 May 2026 15:16:41 -0400 Subject: [PATCH] Add three research documents on parser correctness fundamentals - graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 --- docs/research/cmap-format-and-cid-encoding.md | 360 ++++++++++++++++++ docs/research/content-stream-concatenation.md | 174 +++++++++ docs/research/graphics-state-tracking.md | 280 ++++++++++++++ 3 files changed, 814 insertions(+) create mode 100644 docs/research/cmap-format-and-cid-encoding.md create mode 100644 docs/research/content-stream-concatenation.md create mode 100644 docs/research/graphics-state-tracking.md diff --git a/docs/research/cmap-format-and-cid-encoding.md b/docs/research/cmap-format-and-cid-encoding.md new file mode 100644 index 0000000..a32de0b --- /dev/null +++ b/docs/research/cmap-format-and-cid-encoding.md @@ -0,0 +1,360 @@ +# CMap Format and CID Encoding + +## Purpose + +CMap files are the primary mechanism by which PDF maps character codes to Unicode codepoints. Mishandling them produces garbled output, missing characters, or silent data loss — particularly for CJK text, composite fonts, and any font with a `ToUnicode` stream. This document describes the CMap file format and the parsing requirements an implementation must satisfy. + +The authoritative specifications are the Adobe CMap and CIDFont Files Specification (version 1.0, 2012) and ISO 32000-2 (PDF 2.0). + +--- + +## 1. CMap File Structure + +A CMap file is a PostScript-like text file, not a binary format. Its structure divides into a header block, a body, and a footer. + +**Header block:** + +``` +%!PS-Adobe-3.0 Resource-CMap +%%DocumentNeededResources: ProcSet (CIDInit) +%%IncludeResource: ProcSet (CIDInit) +%%BeginResource: CMap (Adobe-GB1-UCS2) +%%Title: (Adobe-GB1-UCS2 Adobe GB1 0) +%%Version: 1.000 +%%EndComments +/CIDInit /ProcSet findresource begin +12 dict begin +begincmap +/CMapName /Adobe-GB1-UCS2 def +/CMapType 1 def +/CIDSystemInfo + << /Registry (Adobe) + /Ordering (GB1) + /Supplement 0 + >> def +``` + +**Key header fields:** + +- `/CMapName` — the name of this CMap as a PostScript name literal. Used to identify it in `usecmap` references. +- `/CMapType` — integer: + - `0` = code-to-glyph (maps character codes to CIDs; used by Type 0 composite fonts) + - `1` = Unicode-to-glyph (maps Unicode values to CIDs; used for ToUnicode reverse lookups) + - `2` = code-to-Unicode (maps character codes to Unicode; the most common type for text extraction) +- `/CIDSystemInfo` — a dictionary with three required keys: `Registry` (e.g., `Adobe`), `Ordering` (e.g., `GB1`, `Japan1`, `CNS1`, `Korea1`), and `Supplement` (integer revision level). This identifies the glyph collection the CMap targets. +- `/WMode` — `0` for horizontal writing (default), `1` for vertical writing. Horizontal CMaps are sufficient for Unicode extraction; vertical CMaps only affect glyph selection. + +**Footer:** + +``` +endcmap +CMapName usecmap +end +end +%%EndResource +%%EOF +``` + +The `CMapName usecmap` in the footer installs the just-defined CMap into the PostScript resource dictionary — a no-op for PDF parsers, but syntactically required. + +--- + +## 2. Codespace Ranges + +`begincodespacerange` / `endcodespacerange` defines which byte sequences are valid character codes in this CMap. Each entry is a pair of equal-length hex strings: + +``` +begincodespacerange +<00> +<8140> +endcodespacerange +``` + +The first hex string is the lower bound, the second is the upper bound. A byte sequence is a valid character code if it falls within any range (byte-by-byte comparison, same length). The length of the hex string (in bytes) determines how many bytes constitute one character code for that range. + +**Encoding-specific examples:** + +- Single-byte: `<00> ` — codes 0x00–0xFF are each one character. +- Shift-JIS (double-byte lead bytes): `<8140> ` alongside `<00> <7E>` for the ASCII portion. +- GB18030 and Big5 use mixed-length codespaces: some codes are 1 byte, others are 2, and GB18030 also has 4-byte codes. + +**Critical point:** the codespace is not an encoding definition in itself — it is a scan grammar. The parser reads raw bytes from the content stream and uses the codespace ranges to segment them into character codes. This is not UTF-8 and cannot reuse a Unicode decoder. + +--- + +## 3. begincidchar / endcidchar + +Maps individual character codes to CID integers (in Type 0 CMaps) or Unicode codepoints (in ToUnicode CMaps): + +``` +begincidchar +<0041> 65 +<4E2D> 20013 +endcidchar +``` + +In a `ToUnicode` CMap, the value is a hex string rather than a decimal integer: + +``` +begincidchar +<0041> <0041> +endcidchar +``` + +Here `<0041>` on the right is a UTF-16BE encoded Unicode value. For characters above U+FFFF, the right-hand side is a surrogate pair encoded in UTF-16BE: `` for U+20000 (CJK Extension B character). A parser must detect 4-byte right-hand hex strings and decode them via the UTF-16BE surrogate-pair formula: + +``` +high = (value >> 16) & 0xFFFF; +low = value & 0xFFFF; +codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00); +``` + +--- + +## 4. begincidrange / endcidrange + +Maps a contiguous range of character codes to a contiguous range of CIDs or Unicode values: + +``` +begincidrange +<0041> <005A> <0041> +endcidrange +``` + +For each code `c` in `[start, end]`, the mapped value is `start_value + (c - start)`. This is the most compact form and covers the bulk of CJK character mappings. + +A range entry may also specify an array on the right side: + +``` +begincidrange + [ ] +endcidrange +``` + +The array must have exactly `(end - start + 1)` elements. Each element maps explicitly to the code at that offset. This handles non-contiguous Unicode assignments within a contiguous code range, which occurs in vendor character sets where the standard Unicode mapping is irregular. + +--- + +## 5. beginbfchar / endbfchar and beginbfrange / endbfrange + +The `bf` (base-font) variants appear exclusively in `ToUnicode` CMaps. Their right-hand side is always a Unicode string, not a CID integer. + +**bfchar** maps a single code to a Unicode string: + +``` +beginbfchar + <00660069> +endbfchar +``` + +`<00660069>` is the UTF-16BE encoding of the two-character string "fi" (U+0066, U+0069). This means the character code `FB01` is a ligature that expands to two Unicode codepoints. An implementation must produce both codepoints — not just the first — in the output string. + +**bfrange** maps a code range to Unicode strings: + +``` +beginbfrange +<0041> <005A> <0041> +endbfrange +``` + +When the right side is a single hex string, the Unicode value increments by 1 per code step, just as in `cidrange`. When the right side is an array, each element is a full Unicode string for that code offset. + +The distinction between `bfchar`/`bfrange` and `cidchar`/`cidrange` is purely semantic: `bf` variants target Unicode text, `cid` variants target glyph indices. A `ToUnicode` CMap will contain only `bf` variants. A code-to-CID CMap for a composite font will contain only `cid` variants. + +--- + +## 6. usecmap — CMap Inheritance + +A CMap may delegate to a base CMap: + +``` +/UniJIS-UTF16-H usecmap +``` + +This means: for any character code not mapped in the current CMap, consult the named CMap. Inheritance chains can be several levels deep. The predefined CMaps (embedded in conforming PDF viewers) must be known to `pdftract` without requiring them to be embedded in the PDF. + +**Required predefined CMap names for CJK support:** + +- `Identity-H`, `Identity-V` — maps each 2-byte code `` directly to CID `XXXX`; codespace `<0000>` to ``. +- Japanese: `90ms-RKSJ-H`, `90ms-RKSJ-V`, `90msp-RKSJ-H`, `UniJIS-UTF16-H`, `UniJIS-UTF16-V`, `UniJIS2004-UTF16-H`, `UniJIS-UCS2-H`, `UniJIS-UCS2-V`, `H`, `V` +- Simplified Chinese: `UniGB-UCS2-H`, `UniGB-UCS2-V`, `UniGB-UTF16-H`, `UniGB-UTF16-V`, `GBK-EUC-H`, `GBK-EUC-V`, `GBKp-EUC-H`, `GBKp-EUC-V`, `GBK2K-H`, `GBK2K-V`, `GB-EUC-H`, `GB-EUC-V` +- Traditional Chinese: `UniCNS-UCS2-H`, `UniCNS-UCS2-V`, `UniCNS-UTF16-H`, `UniCNS-UTF16-V`, `B5pc-H`, `B5pc-V`, `ETen-B5-H`, `ETen-B5-V`, `CNS-EUC-H`, `CNS-EUC-V` +- Korean: `UniKS-UCS2-H`, `UniKS-UCS2-V`, `UniKS-UTF16-H`, `UniKS-UTF16-V`, `KSCms-UHC-H`, `KSCms-UHC-V`, `KSCms-UHC-HW-H`, `KSCms-UHC-HW-V`, `KSCpc-EUC-H` + +`Identity-H` is the most common: when a PDF uses `Encoding = Identity-H`, every 2-byte code in the content stream is its own CID, and the `ToUnicode` CMap (if present) provides the code-to-Unicode translation layered on top. + +--- + +## 7. Parsing Mixed-Length Codespace + +The character code segmentation algorithm must be implemented as an explicit state machine — it cannot be delegated to `str::from_utf8` or any fixed-width integer read. + +**Algorithm:** + +``` +fn read_code(bytes: &[u8], pos: &mut usize, codespace: &[CodespaceRange]) -> Option { + let mut accum: u32 = 0; + let mut len: usize = 0; + while len < 4 && *pos < bytes.len() { + accum = (accum << 8) | bytes[*pos] as u32; + *pos += 1; + len += 1; + for range in codespace.iter().filter(|r| r.byte_len == len) { + if accum >= range.low && accum <= range.high { + return Some(accum); + } + } + // check if any range of this length or longer might still match + if !codespace.iter().any(|r| r.byte_len > len) { + break; + } + } + None // error: no codespace range matched; caller advances pos by 1 and retries +} +``` + +On a `None` return, the outer loop must advance `pos` by 1 (not by `len`) and retry. This is the error recovery mandated by the specification for malformed content streams. + +The codespace ranges must be sorted by `byte_len` and stored separately per length to make the inner loop O(ranges\_at\_this\_length) rather than O(all\_ranges). + +--- + +## 8. ToUnicode CMap in Practice + +The `ToUnicode` CMap is attached to a font dictionary as a stream: + +``` +/ToUnicode +``` + +The stream contains a complete CMap file. The parser must handle the full file syntax, not just extract mapping sections. + +**Authoring defects that must be handled as partial-mapping cases:** + +- **Empty sections:** `beginbfchar\nendbfchar` with zero entries is legal; do not treat it as a parse error. +- **U+0000 / U+FFFD sentinels:** Some tools map unmapped codes to `<0000>` or ``. Discard these — do not emit NUL or replacement characters into the extracted text. +- **Incomplete coverage:** A ToUnicode CMap may only cover a subset of the codes used in the content stream. Fall through to glyph-name-based Unicode recovery for unmapped codes. +- **Wrong code lengths:** The hex string length of a code in a `bfchar` entry may differ from what the codespace declares. If the mismatch is detectable, prefer the codespace definition for segmentation and use the `bfchar` value for the mapping. + +--- + +## 9. Vertical CMaps + +Vertical CMaps (`WMode 1`) map the same character codes as their horizontal equivalents but to different glyph IDs. The glyphs are rotated or have adjusted metrics for vertical typesetting. The Unicode value of a character does not change when transitioning from horizontal to vertical layout — only the rendered glyph differs. + +For text extraction purposes: if the content stream is processed under a vertical CMap (detected via `WMode 1` in the CMap header or `/WMode 1` in the font dictionary), apply the corresponding horizontal CMap for Unicode mapping. The vertical CMap (`UniJIS-UTF16-V`) inherits from its horizontal counterpart (`UniJIS-UTF16-H`) via `usecmap`; the inherited horizontal mappings cover Unicode lookup. No special vertical-specific Unicode logic is needed. + +--- + +## 10. Implementation: CMap Parser in Rust + +### Tokenizer + +The tokenizer must handle PostScript-like syntax: + +- **Hex strings:** `<4E2D>` — collect bytes between `<` and `>`, ignoring whitespace. Odd-length hex strings should be right-padded with `0`. +- **Decimal integers:** `65`, `20013` — standard integer parsing. +- **Name literals:** `/CMapName` — the `/` is stripped; the remainder is the name. +- **Keywords:** `begincmap`, `endcmap`, `begincodespacerange`, `begincidchar`, `beginbfchar`, `beginbfrange`, `begincidrange`, `usecmap`, `def`, and their `end*` counterparts. +- **Comments:** `%` to end of line — skip. +- **Arrays:** `[` and `]` delimit value arrays in range entries. + +### Core Structs + +```rust +pub struct CodespaceRange { + pub byte_len: usize, + pub low: u32, + pub high: u32, +} + +pub struct BfRange { + pub start: u32, + pub end: u32, + pub target: BfRangeTarget, +} + +pub enum BfRangeTarget { + StartCode(u32), // increment from this Unicode value + Array(Vec), // explicit per-code Unicode strings +} + +pub struct CMap { + pub name: String, + pub cmap_type: u8, + pub wmode: u8, + pub codespace: Vec, // sorted by byte_len + pub bf_char: HashMap, // code → Unicode string + pub bf_range: Vec, // sorted by start for binary search + pub usecmap: Option, // inherited CMap name +} +``` + +### decode Method + +```rust +impl CMap { + pub fn decode(&self, bytes: &[u8], base: Option<&CMap>) -> Vec<(u32, String)> { + let mut out = Vec::new(); + let mut pos = 0; + while pos < bytes.len() { + match read_code(bytes, &mut pos, &self.codespace) { + Some(code) => { + let unicode = self.lookup(code) + .or_else(|| base.and_then(|b| b.lookup(code))); + if let Some(s) = unicode { + out.push((code, s)); + } + // if None: unmapped — caller may attempt glyph-name fallback + } + None => { pos += 1; } // error recovery: skip one byte + } + } + out + } + + fn lookup(&self, code: u32) -> Option { + if let Some(s) = self.bf_char.get(&code) { + return Some(s.clone()); + } + // binary search bf_range by start + let idx = self.bf_range.partition_point(|r| r.start <= code); + if idx > 0 { + let r = &self.bf_range[idx - 1]; + if code <= r.end { + return Some(match &r.target { + BfRangeTarget::StartCode(base) => { + char::from_u32(base + (code - r.start)) + .map(|c| c.to_string()) + .unwrap_or_default() + } + BfRangeTarget::Array(arr) => { + arr.get((code - r.start) as usize).cloned().unwrap_or_default() + } + }); + } + } + None + } +} +``` + +### Inheritance and Predefined CMaps + +Resolve `usecmap` by: + +1. Checking whether the named CMap is embedded in the PDF (look up the `CMap` resource dictionary in the PDF's document catalog or page resource dictionaries). +2. Falling back to a built-in table of predefined CMap data compiled into the library. `Identity-H` and `Identity-V` must always be available as built-ins since they are extremely common and have trivial definitions. + +Guard against circular references by tracking visited names in a `HashSet` during resolution. A chain longer than 8 levels should be treated as an error. + +### UTF-16BE String Conversion + +When converting a right-hand hex string to a Rust `String`: + +1. Interpret bytes as UTF-16BE code units. +2. Collect surrogate pairs: when a high surrogate (`0xD800`–`0xDBFF`) is followed by a low surrogate (`0xDC00`–`0xDFFF`), decode to a single codepoint. +3. Use `char::from_u32` and reject values that are not valid Unicode scalar values. +4. Concatenate all resulting `char` values into the `String`. + +This handles both BMP characters and supplementary characters (above U+FFFF) correctly without relying on platform-specific wide-character APIs. diff --git a/docs/research/content-stream-concatenation.md b/docs/research/content-stream-concatenation.md new file mode 100644 index 0000000..7daefbd --- /dev/null +++ b/docs/research/content-stream-concatenation.md @@ -0,0 +1,174 @@ +# Content Stream Concatenation and Resource Resolution in PDF + +## Overview + +PDF text extraction depends on two coupled mechanisms: correct reassembly of content streams and correct resolution of resource names within those streams. Both are more complex than a naïve reading of the spec suggests. This document covers each mechanism in the depth required for a correct Rust implementation. + +--- + +## 1. Single vs. Multiple Content Streams + +A page dictionary's `/Contents` entry is either a single indirect reference to a stream object or an array of such references (PDF 1.7 spec §7.8.2). Both forms are valid; generators choose based on workflow — incremental update tools often append a new content stream rather than rewriting the existing one. + +``` +% Single stream +/Contents 42 0 R + +% Array of streams +/Contents [42 0 R 43 0 R 44 0 R] +``` + +When `/Contents` is an array, the streams are logically concatenated in order to form one content stream. The spec mandates that a single space or newline (0x0A) be inserted between adjacent streams before parsing. The reason is token-boundary safety: a stream may end with the byte sequence `BT` (no trailing whitespace) and the next stream may begin with `q`. The raw concatenation produces `BTq`, which is a single unrecognized operator token. Inserting `0x0A` yields `BT\nq` — two valid tokens. This newline must be injected regardless of whether the adjacent bytes appear to need it; looking ahead to determine whether the boundary is safe is not reliable in the general case. + +--- + +## 2. Content Stream Concatenation Semantics + +After concatenation (with injected newlines), the resulting byte sequence is parsed as a **single logical content stream**. The PDF graphics model is stateful, and that state is continuous across the boundary between adjacent streams: + +- **Graphics state stack**: A `q` operator in stream N pushes a graphics state entry; the matching `Q` may appear in stream N+1 or any later stream. The stack depth is not reset between streams. +- **Text object**: A `BT` in stream N without a matching `ET` before stream N ends is technically non-conforming, but common. The parser must carry the text object state (current font, text matrix, text line matrix) across the boundary. Treat encountering `BT` without a prior `ET` as an implicit end of the previous text object followed by the start of a new one — but do not reset state that `ET` would not reset (e.g., the graphics state stack depth). +- **Marked-content nesting**: `BMC`/`BDC` and `EMC` markers may straddle stream boundaries (see §10 below). The nesting counter must persist across boundaries. + +The implementation consequence: the content stream reader must operate on a logical `ContentStreamReader` abstraction that owns an iterator over `(stream_index, byte_offset, byte)` tuples rather than a flat `&[u8]`. The newline injection happens at the seam between streams, not in the stored data. + +--- + +## 3. Resources Dictionary Resolution + +Every named resource referenced in a content stream — fonts, XObjects, color spaces, etc. — is resolved through a `/Resources` dictionary. The lookup path follows the page tree (PDF 1.7 §7.7.3.4): + +1. Check the page dictionary's own `/Resources`. +2. If absent or the name is not found there, walk up through parent `/Pages` nodes in order, checking each `/Resources` entry. +3. The **first** definition found wins — page-level overrides parent-level. + +Some generators omit per-page `/Resources` entirely and rely on the root `/Pages` node's `/Resources`. This is permitted by the spec and must be handled. A correct implementation resolves the inherited resource dictionary chain at page load time, before processing any content stream, and caches the result as a merged view (do not re-walk the tree on every resource lookup). + +The page-level resource dictionary is the outermost scope in the resource name stack (§6 below). + +--- + +## 4. Form XObject Resources + +A Form XObject is an XObject stream with `/Subtype /Form`. It has its own `/Resources` dictionary (PDF 1.7 §8.10.2). When the `Do` operator invokes a Form XObject: + +- **Resource lookup inside the Form's content stream uses the Form's `/Resources` exclusively.** +- Page resources are **not** visible inside the Form. The Form's content stream is a self-contained resource scope. +- A font named `/F1` in the page's Resources and a font named `/F1` in the Form's Resources are independent entries that may refer to different font objects. + +The common extraction failure mode: a `Tf /F1 12` inside a Form XObject resolves against the Form's `/Resources/Font`, not the page's. If the font is not declared in the Form's Resources, the resource lookup fails — even if the page declares the same name. The PDF spec does not permit fallback to the parent scope for Form XObjects. + +--- + +## 5. Type 3 Font Glyph Stream Resources + +A Type 3 font defines each glyph as a content stream stored under `/CharProcs`. Each glyph stream is parsed in its own resource scope: the `/Resources` entry in the **Type 3 font dictionary** (PDF 1.7 §9.6.5). + +- Glyph streams do not use the page resources. +- Glyph streams do not use the Form XObject resources (even if the Type 3 font was invoked from inside a Form). +- Only the Type 3 font's own `/Resources` applies. + +This creates a three-level nesting: page → Form XObject → Type 3 font glyph, each with its own resource scope. The resource name stack (§6) must support all three levels simultaneously. + +--- + +## 6. The Resource Name Stack + +At any point during parsing, there is an ordered stack of active resource dictionaries. Lookup proceeds from innermost (top of stack) to outermost (bottom): + +``` +[page_resources, form_resources?, type3_font_resources?] + ^--- bottom (outermost) ^--- top (innermost) +``` + +In Rust: + +```rust +struct ResourceStack<'a> { + stack: Vec<&'a Resources>, +} + +impl<'a> ResourceStack<'a> { + fn lookup_font(&self, name: &Name) -> Option<&'a FontRef> { + for resources in self.stack.iter().rev() { + if let Some(font) = resources.fonts.get(name) { + return Some(font); + } + } + None + } + + fn push(&mut self, r: &'a Resources) { self.stack.push(r); } + fn pop(&mut self) { self.stack.pop(); } +} +``` + +Push on entry to a Form XObject (`Do`) or Type 3 glyph stream; pop on exit. For Type 3 glyphs, "exit" is the end of the glyph's content stream. For Form XObjects, "exit" is reaching the end of the Form's content stream after the `Do` operator dispatched into it. + +The page-level resources are always the bottom of the stack (index 0) and are never popped during page processing. + +--- + +## 7. Operators That Reference Resources + +Every operator that references a resource name must resolve that name through the current resource stack: + +| Operator | Resource sub-dictionary | Key | +|----------|------------------------|-----| +| `Tf name size` | `/Font` | font name | +| `Do name` | `/XObject` | XObject name | +| `cs name` | `/ColorSpace` | color space name | +| `CS name` | `/ColorSpace` | color space name | +| `gs name` | `/ExtGState` | graphics state name | +| `sh name` | `/Shading` | shading name | +| `scn`/`SCN` (with pattern name argument) | `/Pattern` | pattern name | +| `BDC /OC /Properties name` | `/Properties` | marked-content property | + +The `Tf` operator is the most critical for text extraction: it sets the current font, which determines how character codes in subsequent `Tj`/`TJ`/`'`/`"` operators map to Unicode. An unresolved font name must be surfaced as a recoverable error — the parser should log the failure and skip character mapping for the affected text spans rather than aborting page extraction. + +--- + +## 8. Inline Images vs. XObjects in Streams + +Inline images are encoded directly in the content stream bytes between `BI`, `ID`, and `EI` operators (PDF 1.7 §8.9.7). The image data between `ID` and `EI` is raw binary and may contain arbitrary bytes, including sequences that look like PDF operators. + +The `EI` token is identified as: a `EI` byte pair preceded by whitespace and followed by whitespace or end of stream. Do not scan for a bare `EI` pattern inside the image data. + +When the image parameters include `/W` (width), `/H` (height), and `/BPC` (bits per component) and no compression filter is applied, the data length is exactly `ceil(W * BPC / 8) * H` bytes. Parse that many bytes after `ID` and then expect `EI`. With a compression filter (`/F` or `/Filter`), the data ends at the first whitespace-delimited `EI` token after decompression. In practice, treating the search as: skip past any initial whitespace after `ID`, read bytes until a valid whitespace-delimited `EI` is found (with a heuristic maximum scan length), is the safe fallback for malformed streams. + +--- + +## 9. Pages with Many Content Streams + +Some generators produce pages with dozens or hundreds of content streams — one per drawing element, update layer, or annotation. This is legal. The page-level concatenation must handle an arbitrary-length `/Contents` array. + +Memory management is essential: do not decompress all streams into memory before parsing. Use a lazy `ContentStreamSource` that implements `Iterator>` and decompresses each stream on demand. The parser reads from one stream at a time, injecting the newline separator at each boundary, without buffering more than one stream's bytes in memory. + +Complex technical drawings (CAD exports, large engineering PDFs) can produce per-page content totaling hundreds of megabytes after decompression. A pull-based streaming parser is the only viable architecture. + +--- + +## 10. Optional Content in Content Streams + +Optional Content Groups (OCGs) control the visibility of content regions. They are activated with `BMC`/`BDC` (begin marked content) and deactivated with `EMC` (end marked content). An OCG region opened in stream N may be closed in stream N+1: + +``` +% stream 1 +/OC /MC0 BDC + BT /F1 12 Tf (Hello) Tj ET +% stream 1 ends here — no EMC + +% stream 2 + BT /F1 12 Tf (World) Tj ET +EMC +``` + +The OCG nesting depth and the current active OCG stack must persist across stream boundaries, just like the graphics state stack. An `EMC` in stream 2 closes the region opened in stream 1. + +The `/Properties` name referenced in a `BDC` inline dictionary or as an argument is resolved through the current resource stack's `/Properties` sub-dictionary, subject to the same innermost-first lookup as all other resource names. An OCG reference inside a Form XObject resolves through the Form's Resources `/Properties`, not the page's. + +--- + +## Summary + +Correct content stream processing requires treating the page's `/Contents` array as a single logical stream with injected newline separators at each boundary, maintaining continuous parser state (graphics stack, text object, OCG nesting) across those boundaries, and resolving every resource name through a dynamically managed stack of resource dictionaries whose depth reflects the current nesting level (page → Form XObject → Type 3 glyph). Failures in any of these three areas produce incorrect text extraction silently — wrong glyphs, missing text spans, or incorrect reading order — making them the highest-priority correctness requirements for the parser implementation. diff --git a/docs/research/graphics-state-tracking.md b/docs/research/graphics-state-tracking.md new file mode 100644 index 0000000..d81bd7f --- /dev/null +++ b/docs/research/graphics-state-tracking.md @@ -0,0 +1,280 @@ +# Graphics State Tracking for PDF Text Extraction + +Correct text extraction in pdftract requires more than decoding glyph sequences. Whether a glyph is visible, what color it renders at, and where on the page it lands all depend on state that accumulates across operators in the content stream. Mishandling this state causes invisible text to contaminate output and visible text to be silently dropped. + +--- + +## 1. The Graphics State Stack + +The PDF content stream is a stateful machine. A graphics state object encapsulates every rendering parameter at a point in the stream. The `q` operator pushes a complete clone of the current state onto a stack; `Q` pops and restores it. The PDF specification (ISO 32000-2, §8.4.2) recommends implementations support at least 28 nesting levels. + +The state that must be cloned on `q` includes: + +- **CTM** — current transformation matrix, 6 floats `[a b c d e f]` +- **Clipping path** — the active clip region +- **Color space and color** — separately for fill and stroke +- **Line parameters** — width, cap style, join style, miter limit, dash pattern +- **Rendering intent** — a name value +- **Stroke adjustment flag** +- **Blend mode** — a name (e.g., `/Normal`, `/Multiply`) +- **Soft mask** — a dictionary or `/None` +- **Alpha constants** — `ca` (fill alpha) and `CA` (stroke alpha), both `f32` in `[0.0, 1.0]` +- **Alpha is shape flag** (`AIS`) +- **Text state** — the entire set of text parameters described in §2 + +A missing or shallow clone on `q` is a latent bug: an inner content stream that changes color or alpha will corrupt the outer stream's state after `Q`. + +--- + +## 2. Text State Within the Graphics State + +Text state is a subgroup of the graphics state and is saved and restored with `q`/`Q`. The text state operators and their targets: + +| Operator | Parameter modified | +|----------|--------------------| +| `Tf name size` | Current font (resource name) and font size in text space | +| `Tc value` | Character spacing (added after each glyph, in text space units) | +| `Tw value` | Word spacing (added after each ASCII space, 0x20) | +| `Tz value` | Horizontal scaling, expressed as a percentage (100 = normal) | +| `TL value` | Leading, used by `T*` and `'` operators | +| `Tr value` | Text rendering mode (integer 0–7) | +| `Ts value` | Text rise, vertical offset in text space | + +**Text matrices are separate.** The text matrix `Tm` and the text line matrix `Tlm` are *not* part of the graphics state. They are initialized by `BT` (begin text object) and are undefined outside a `BT`/`ET` pair. `Td`, `TD`, `T*`, and `Tm` modify these matrices during a text object. They are not saved or restored by `q`/`Q`. Implementations that try to persist them across `q`/`Q` will produce incorrect glyph positions. + +--- + +## 3. Color Space Tracking + +The current color space for fill and stroke are tracked independently. + +**Explicit color space selection:** +- `cs name` — set fill color space to a named entry from `Resources/ColorSpace` +- `CS name` — set stroke color space to a named entry from `Resources/ColorSpace` +- Device names `/DeviceRGB`, `/DeviceGray`, `/DeviceCMYK` can appear directly + +**Shorthand operators that set both space and color atomically:** + +| Operator | Color space | Arguments | +|----------|-------------|-----------| +| `rg r g b` | DeviceRGB (fill) | three floats in [0,1] | +| `RG r g b` | DeviceRGB (stroke) | three floats in [0,1] | +| `g gray` | DeviceGray (fill) | one float in [0,1] | +| `G gray` | DeviceGray (stroke) | one float in [0,1] | +| `k c m y k` | DeviceCMYK (fill) | four floats in [0,1] | +| `K c m y k` | DeviceCMYK (stroke) | four floats in [0,1] | + +**General color operators** `sc`/`scn` (fill) and `SC`/`SCN` (stroke) set the color within the currently active color space. The argument count depends on the space. + +**Normalized luminance for visibility.** To determine whether text color contrasts with the page background (typically white), convert to a single luminance value: + +- DeviceGray: luminance = `gray` +- DeviceRGB: luminance = `0.2126 * r + 0.7152 * g + 0.0722 * b` (sRGB coefficients per IEC 61966-2-1) +- DeviceCMYK: convert to RGB first: `r = (1-c)*(1-k)`, `g = (1-m)*(1-k)`, `b = (1-y)*(1-k)`, then apply the RGB formula +- CalRGB, ICCBased: use the RGB channel values after applying the color space transformation, then the RGB formula + +Text with luminance near 1.0 on a white background is invisible regardless of alpha. Track fill color luminance as a `f32`; any value above approximately 0.95 against a white background should be flagged as potentially invisible. + +--- + +## 4. ExtGState Dictionary + +The `gs name` operator loads a graphics state parameter dictionary from `Resources/ExtGState`. This is the primary mechanism for setting transparency parameters. Keys relevant to text: + +| Key | Type | Effect | +|-----|------|--------| +| `ca` | number | Fill (non-stroking) alpha constant, 0.0–1.0 | +| `CA` | number | Stroke alpha constant, 0.0–1.0 | +| `BM` | name or array | Blend mode | +| `SMask` | dict or `/None` | Soft mask; `/None` clears any active mask | +| `AIS` | boolean | Alpha is shape | +| `SA` | boolean | Stroke adjustment | +| `Font` | array `[ref size]` | Sets current font and size, same effect as `Tf` | + +`apply_gs` must iterate the dictionary and update only the keys present — absent keys leave the corresponding state unchanged. + +**SMask dictionary structure.** When `SMask` is a dictionary rather than `/None`: +- `S` — `/Alpha` or `/Luminosity`: determines how the mask value is extracted from the group result +- `G` — a Form XObject stream that is rendered to produce the mask +- `BC` — backdrop color (array of color components) +- `TR` — transfer function applied to the mask values + +--- + +## 5. Clipping Path Management + +The initial clipping path for a page is the MediaBox (or CropBox if present). Within content streams, clipping is modified by `W` (nonzero winding rule) and `W*` (even-odd rule). These operators are path-painting modifiers: they take effect *after* path construction is complete and *before* or *instead of* a painting operator. The sequence is: + +1. Construct path via `m`, `l`, `c`, `re`, etc. +2. Issue `W` or `W*` — marks intent to clip +3. Issue a painting operator (`S`, `f`, `n`, etc.) or just `n` to apply the clip without painting + +The clip region is **intersected** with the constructed path shape — it can only shrink, never expand. The resulting clip becomes the new current clipping path. + +For text extraction, maintaining an exact polygon clip is expensive. A practical approximation: track the clip as an axis-aligned bounding box (`[x_min, y_min, x_max, y_max]` in user space). When `W`/`W*` fires, intersect the tracked bbox with the bounding box of the current path. For most documents this approximation is exact; non-rectangular clips are edge cases flagged for further analysis. + +The clipping path is fully saved and restored by `q`/`Q`. + +--- + +## 6. Current Transformation Matrix + +The CTM is a 3×3 matrix in column-major form, represented by 6 values `[a b c d e f]` with the third row implicitly `[0 0 1]`. The `cm a b c d e f` operator **pre-multiplies** the current CTM by the new matrix: + +``` +CTM_new = [a b c d e f] × CTM_current +``` + +In row-vector convention (PDF uses row vectors), concatenation means the new transform is applied first. The implementation must preserve exact multiplication order. + +Matrix concatenation: + +```rust +fn concat(m: [f64; 6], ctm: [f64; 6]) -> [f64; 6] { + [ + m[0]*ctm[0] + m[1]*ctm[2], + m[0]*ctm[1] + m[1]*ctm[3], + m[2]*ctm[0] + m[3]*ctm[2], + m[2]*ctm[1] + m[3]*ctm[3], + m[4]*ctm[0] + m[5]*ctm[2] + ctm[4], + m[4]*ctm[1] + m[5]*ctm[3] + ctm[5], + ] +} +``` + +Glyph positions are computed in text space, transformed by the text matrix `Tm`, then by the CTM. The resulting device-space coordinates determine where the glyph appears on the page and whether it falls within the clipping bbox. + +--- + +## 7. Blend Mode Effects on Visibility + +The blend mode controls how a graphics object composites over the content beneath it. For text extraction, the key question is whether the blend mode can render text invisible. + +- **`/Normal` and `/Compatible`** — the source color replaces the destination at the source's alpha. At `ca=1.0`, text is fully opaque in its declared color. +- **`/Multiply`** — multiplies source and destination color channels. Text drawn in black (0,0,0) on any background remains black. Text drawn in white (1,1,1) becomes invisible against a white background. +- **`/Screen`** — `1 - (1-s)*(1-d)`. Light-colored text lightens rather than covers. +- **`/Overlay`, `/HardLight`, `/SoftLight`** — result depends on the luminance of the destination, which is unknown without rendering. +- **`/Difference`, `/Exclusion`** — text color is the absolute difference with the background. + +Practical rule: if blend mode is not `/Normal` or `/Compatible`, the actual rendered color cannot be determined without knowing the destination. Flag such text as `blend_mode_dependent` and rely on `ca` as the primary visibility signal. A `ca` of 0.0 guarantees invisibility; any positive value with a non-Normal blend mode is ambiguous. + +--- + +## 8. Soft Mask Interaction + +A soft mask applies a per-pixel transparency derived from a separately rendered Form XObject. The effective alpha at any pixel is `ca * mask_value(x, y)`. Since `mask_value` is in `[0.0, 1.0]`, the constant alpha `ca` is an upper bound on the effective alpha. + +Fully rendering the mask Form XObject is expensive and outside the scope of a text extraction pass. The practical approach: + +1. When `SMask` is set to a dictionary (not `/None`), set a boolean flag `soft_mask_present: true` on the graphics state. +2. Use `ca` as a lower-bound visibility signal: if `ca == 0.0`, text is invisible regardless of the mask. +3. For `ca > 0.0` with an active soft mask, text is marked `soft_mask_present` and conservatively included in output — it may be partially or fully transparent depending on the mask, but exclusion risks losing real content. + +Clearing: `gs` with `SMask /None` clears the active soft mask. + +--- + +## 9. Form XObject Graphics State Isolation + +When `Do name` invokes a Form XObject, the PDF processor must: + +1. Save the current graphics state (equivalent to `q`) +2. Concatenate the Form XObject's `/Matrix` (if present) with the current CTM +3. Apply the Form XObject's `/BBox` as an additional clip +4. Parse the Form XObject's content stream, using its `/Resources` dictionary for name resolution +5. Restore the graphics state (equivalent to `Q`) when the stream ends + +Graphics state mutations inside the Form XObject — color changes, alpha updates, CTM modifications, clip changes — do not persist after the `Do` operator completes. Resource name resolution switches to the Form XObject's `/Resources` during parsing and reverts after. Failing to isolate Form XObject state is a common source of color and font state corruption. + +--- + +## 10. Implementation: the `GraphicsState` Struct + +```rust +#[derive(Clone)] +pub struct GraphicsState { + // Transformation + pub ctm: [f64; 6], + + // Color (fill) + pub fill_color_space: ColorSpace, + pub fill_color: ColorValue, + pub fill_alpha: f32, + + // Color (stroke) + pub stroke_color_space: ColorSpace, + pub stroke_color: ColorValue, + pub stroke_alpha: f32, + + // Transparency + pub blend_mode: BlendMode, + pub soft_mask_present: bool, + + // Clipping (bbox approximation in user space) + pub clip_bbox: Option<[f64; 4]>, // [x_min, y_min, x_max, y_max] + + // Text state + pub text_rendering_mode: u8, // 0–7 per PDF spec + pub text_rise: f64, + pub font_name: Option, + pub font_size: f64, + pub char_spacing: f64, + pub word_spacing: f64, + pub horiz_scaling: f64, // percentage, default 100.0 + pub leading: f64, +} + +pub struct GraphicsStateStack { + stack: Vec, +} + +impl GraphicsStateStack { + pub fn save(&mut self) { + let top = self.stack.last().expect("empty stack").clone(); + self.stack.push(top); + } + + pub fn restore(&mut self) { + if self.stack.len() > 1 { + self.stack.pop(); + } + } + + pub fn current(&mut self) -> &mut GraphicsState { + self.stack.last_mut().expect("empty stack") + } +} +``` + +**`is_text_visible`** must combine all signals: + +```rust +pub fn is_text_visible(&self) -> bool { + // Rendering mode 3 = invisible (clip only) + if self.text_rendering_mode == 3 { return false; } + + // Zero alpha = invisible + if self.fill_alpha == 0.0 { return false; } + + // Clipping: if clip bbox has zero area, text is outside + if let Some(bbox) = self.clip_bbox { + if bbox[0] >= bbox[2] || bbox[1] >= bbox[3] { return false; } + } + + // High-luminance fill on assumed white background + let lum = self.fill_color.luminance(); + if lum > 0.95 && self.fill_alpha > 0.0 { return false; } + + true +} +``` + +**`apply_gs`** iterates the ExtGState dictionary entries and applies each recognized key. Unknown keys are ignored per the spec's extensibility rules. + +**`apply_cm`** calls the `concat` function above to pre-multiply the new matrix into the current CTM. + +--- + +## Summary + +Full graphics state tracking is not optional for accurate text extraction. The rendering mode, alpha constants, blend mode, soft mask, fill color, clipping path, and CTM each independently contribute to whether a glyph appears on the page and where. The stack mechanics of `q`/`Q` must clone the complete state. Form XObjects must isolate their state changes. Text matrices (`Tm`, `Tlm`) are separate from the graphics state and must not be conflated with it. The `is_text_visible` predicate synthesizes all tracked signals into a single decision that drives inclusion or exclusion of glyphs from the extraction output.