pdftract/docs/research/content-stream-concatenation.md
jedarden 9420964b73 Add three research documents on parser correctness fundamentals
- graphics-state-tracking: full q/Q stack, text state operators, color
  space tracking, ExtGState keys, clip path management, CTM concatenation,
  blend mode/soft mask visibility, Form XObject isolation, GraphicsState
  Rust struct with is_text_visible implementation
- cmap-format-and-cid-encoding: CMap file structure, codespace range
  scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap
  inheritance with predefined CJK CMap inventory, mixed-length parsing
  state machine, ToUnicode defect handling, Rust CMap struct design
- content-stream-concatenation: multi-stream concatenation with 0x0A
  injection, continuous graphics state across boundaries, resource
  inheritance page-tree walk, Form XObject and Type 3 resource isolation,
  ResourceStack design, EI disambiguation in binary data, lazy decompression

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:16:41 -04:00

174 lines
11 KiB
Markdown

# Content Stream Concatenation and Resource Resolution in PDF
## Overview
PDF text extraction depends on two coupled mechanisms: correct reassembly of content streams and correct resolution of resource names within those streams. Both are more complex than a naïve reading of the spec suggests. This document covers each mechanism in the depth required for a correct Rust implementation.
---
## 1. Single vs. Multiple Content Streams
A page dictionary's `/Contents` entry is either a single indirect reference to a stream object or an array of such references (PDF 1.7 spec §7.8.2). Both forms are valid; generators choose based on workflow — incremental update tools often append a new content stream rather than rewriting the existing one.
```
% Single stream
/Contents 42 0 R
% Array of streams
/Contents [42 0 R 43 0 R 44 0 R]
```
When `/Contents` is an array, the streams are logically concatenated in order to form one content stream. The spec mandates that a single space or newline (0x0A) be inserted between adjacent streams before parsing. The reason is token-boundary safety: a stream may end with the byte sequence `BT` (no trailing whitespace) and the next stream may begin with `q`. The raw concatenation produces `BTq`, which is a single unrecognized operator token. Inserting `0x0A` yields `BT\nq` — two valid tokens. This newline must be injected regardless of whether the adjacent bytes appear to need it; looking ahead to determine whether the boundary is safe is not reliable in the general case.
---
## 2. Content Stream Concatenation Semantics
After concatenation (with injected newlines), the resulting byte sequence is parsed as a **single logical content stream**. The PDF graphics model is stateful, and that state is continuous across the boundary between adjacent streams:
- **Graphics state stack**: A `q` operator in stream N pushes a graphics state entry; the matching `Q` may appear in stream N+1 or any later stream. The stack depth is not reset between streams.
- **Text object**: A `BT` in stream N without a matching `ET` before stream N ends is technically non-conforming, but common. The parser must carry the text object state (current font, text matrix, text line matrix) across the boundary. Treat encountering `BT` without a prior `ET` as an implicit end of the previous text object followed by the start of a new one — but do not reset state that `ET` would not reset (e.g., the graphics state stack depth).
- **Marked-content nesting**: `BMC`/`BDC` and `EMC` markers may straddle stream boundaries (see §10 below). The nesting counter must persist across boundaries.
The implementation consequence: the content stream reader must operate on a logical `ContentStreamReader` abstraction that owns an iterator over `(stream_index, byte_offset, byte)` tuples rather than a flat `&[u8]`. The newline injection happens at the seam between streams, not in the stored data.
---
## 3. Resources Dictionary Resolution
Every named resource referenced in a content stream — fonts, XObjects, color spaces, etc. — is resolved through a `/Resources` dictionary. The lookup path follows the page tree (PDF 1.7 §7.7.3.4):
1. Check the page dictionary's own `/Resources`.
2. If absent or the name is not found there, walk up through parent `/Pages` nodes in order, checking each `/Resources` entry.
3. The **first** definition found wins — page-level overrides parent-level.
Some generators omit per-page `/Resources` entirely and rely on the root `/Pages` node's `/Resources`. This is permitted by the spec and must be handled. A correct implementation resolves the inherited resource dictionary chain at page load time, before processing any content stream, and caches the result as a merged view (do not re-walk the tree on every resource lookup).
The page-level resource dictionary is the outermost scope in the resource name stack (§6 below).
---
## 4. Form XObject Resources
A Form XObject is an XObject stream with `/Subtype /Form`. It has its own `/Resources` dictionary (PDF 1.7 §8.10.2). When the `Do` operator invokes a Form XObject:
- **Resource lookup inside the Form's content stream uses the Form's `/Resources` exclusively.**
- Page resources are **not** visible inside the Form. The Form's content stream is a self-contained resource scope.
- A font named `/F1` in the page's Resources and a font named `/F1` in the Form's Resources are independent entries that may refer to different font objects.
The common extraction failure mode: a `Tf /F1 12` inside a Form XObject resolves against the Form's `/Resources/Font`, not the page's. If the font is not declared in the Form's Resources, the resource lookup fails — even if the page declares the same name. The PDF spec does not permit fallback to the parent scope for Form XObjects.
---
## 5. Type 3 Font Glyph Stream Resources
A Type 3 font defines each glyph as a content stream stored under `/CharProcs`. Each glyph stream is parsed in its own resource scope: the `/Resources` entry in the **Type 3 font dictionary** (PDF 1.7 §9.6.5).
- Glyph streams do not use the page resources.
- Glyph streams do not use the Form XObject resources (even if the Type 3 font was invoked from inside a Form).
- Only the Type 3 font's own `/Resources` applies.
This creates a three-level nesting: page → Form XObject → Type 3 font glyph, each with its own resource scope. The resource name stack (§6) must support all three levels simultaneously.
---
## 6. The Resource Name Stack
At any point during parsing, there is an ordered stack of active resource dictionaries. Lookup proceeds from innermost (top of stack) to outermost (bottom):
```
[page_resources, form_resources?, type3_font_resources?]
^--- bottom (outermost) ^--- top (innermost)
```
In Rust:
```rust
struct ResourceStack<'a> {
stack: Vec<&'a Resources>,
}
impl<'a> ResourceStack<'a> {
fn lookup_font(&self, name: &Name) -> Option<&'a FontRef> {
for resources in self.stack.iter().rev() {
if let Some(font) = resources.fonts.get(name) {
return Some(font);
}
}
None
}
fn push(&mut self, r: &'a Resources) { self.stack.push(r); }
fn pop(&mut self) { self.stack.pop(); }
}
```
Push on entry to a Form XObject (`Do`) or Type 3 glyph stream; pop on exit. For Type 3 glyphs, "exit" is the end of the glyph's content stream. For Form XObjects, "exit" is reaching the end of the Form's content stream after the `Do` operator dispatched into it.
The page-level resources are always the bottom of the stack (index 0) and are never popped during page processing.
---
## 7. Operators That Reference Resources
Every operator that references a resource name must resolve that name through the current resource stack:
| Operator | Resource sub-dictionary | Key |
|----------|------------------------|-----|
| `Tf name size` | `/Font` | font name |
| `Do name` | `/XObject` | XObject name |
| `cs name` | `/ColorSpace` | color space name |
| `CS name` | `/ColorSpace` | color space name |
| `gs name` | `/ExtGState` | graphics state name |
| `sh name` | `/Shading` | shading name |
| `scn`/`SCN` (with pattern name argument) | `/Pattern` | pattern name |
| `BDC /OC /Properties name` | `/Properties` | marked-content property |
The `Tf` operator is the most critical for text extraction: it sets the current font, which determines how character codes in subsequent `Tj`/`TJ`/`'`/`"` operators map to Unicode. An unresolved font name must be surfaced as a recoverable error — the parser should log the failure and skip character mapping for the affected text spans rather than aborting page extraction.
---
## 8. Inline Images vs. XObjects in Streams
Inline images are encoded directly in the content stream bytes between `BI`, `ID`, and `EI` operators (PDF 1.7 §8.9.7). The image data between `ID` and `EI` is raw binary and may contain arbitrary bytes, including sequences that look like PDF operators.
The `EI` token is identified as: a `EI` byte pair preceded by whitespace and followed by whitespace or end of stream. Do not scan for a bare `EI` pattern inside the image data.
When the image parameters include `/W` (width), `/H` (height), and `/BPC` (bits per component) and no compression filter is applied, the data length is exactly `ceil(W * BPC / 8) * H` bytes. Parse that many bytes after `ID` and then expect `EI`. With a compression filter (`/F` or `/Filter`), the data ends at the first whitespace-delimited `EI` token after decompression. In practice, treating the search as: skip past any initial whitespace after `ID`, read bytes until a valid whitespace-delimited `EI` is found (with a heuristic maximum scan length), is the safe fallback for malformed streams.
---
## 9. Pages with Many Content Streams
Some generators produce pages with dozens or hundreds of content streams — one per drawing element, update layer, or annotation. This is legal. The page-level concatenation must handle an arbitrary-length `/Contents` array.
Memory management is essential: do not decompress all streams into memory before parsing. Use a lazy `ContentStreamSource` that implements `Iterator<Item = Result<DecompressedStream>>` and decompresses each stream on demand. The parser reads from one stream at a time, injecting the newline separator at each boundary, without buffering more than one stream's bytes in memory.
Complex technical drawings (CAD exports, large engineering PDFs) can produce per-page content totaling hundreds of megabytes after decompression. A pull-based streaming parser is the only viable architecture.
---
## 10. Optional Content in Content Streams
Optional Content Groups (OCGs) control the visibility of content regions. They are activated with `BMC`/`BDC` (begin marked content) and deactivated with `EMC` (end marked content). An OCG region opened in stream N may be closed in stream N+1:
```
% stream 1
/OC /MC0 BDC
BT /F1 12 Tf (Hello) Tj ET
% stream 1 ends here — no EMC
% stream 2
BT /F1 12 Tf (World) Tj ET
EMC
```
The OCG nesting depth and the current active OCG stack must persist across stream boundaries, just like the graphics state stack. An `EMC` in stream 2 closes the region opened in stream 1.
The `/Properties` name referenced in a `BDC` inline dictionary or as an argument is resolved through the current resource stack's `/Properties` sub-dictionary, subject to the same innermost-first lookup as all other resource names. An OCG reference inside a Form XObject resolves through the Form's Resources `/Properties`, not the page's.
---
## Summary
Correct content stream processing requires treating the page's `/Contents` array as a single logical stream with injected newline separators at each boundary, maintaining continuous parser state (graphics stack, text object, OCG nesting) across those boundaries, and resolving every resource name through a dynamically managed stack of resource dictionaries whose depth reflects the current nesting level (page → Form XObject → Type 3 glyph). Failures in any of these three areas produce incorrect text extraction silently — wrong glyphs, missing text spans, or incorrect reading order — making them the highest-priority correctness requirements for the parser implementation.