- graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Content Stream Concatenation and Resource Resolution in PDF
Overview
PDF text extraction depends on two coupled mechanisms: correct reassembly of content streams and correct resolution of resource names within those streams. Both are more complex than a naïve reading of the spec suggests. This document covers each mechanism in the depth required for a correct Rust implementation.
1. Single vs. Multiple Content Streams
A page dictionary's /Contents entry is either a single indirect reference to a stream object or an array of such references (PDF 1.7 spec §7.8.2). Both forms are valid; generators choose based on workflow — incremental update tools often append a new content stream rather than rewriting the existing one.
% Single stream
/Contents 42 0 R
% Array of streams
/Contents [42 0 R 43 0 R 44 0 R]
When /Contents is an array, the streams are logically concatenated in order to form one content stream. The spec mandates that a single space or newline (0x0A) be inserted between adjacent streams before parsing. The reason is token-boundary safety: a stream may end with the byte sequence BT (no trailing whitespace) and the next stream may begin with q. The raw concatenation produces BTq, which is a single unrecognized operator token. Inserting 0x0A yields BT\nq — two valid tokens. This newline must be injected regardless of whether the adjacent bytes appear to need it; looking ahead to determine whether the boundary is safe is not reliable in the general case.
2. Content Stream Concatenation Semantics
After concatenation (with injected newlines), the resulting byte sequence is parsed as a single logical content stream. The PDF graphics model is stateful, and that state is continuous across the boundary between adjacent streams:
- Graphics state stack: A
qoperator in stream N pushes a graphics state entry; the matchingQmay appear in stream N+1 or any later stream. The stack depth is not reset between streams. - Text object: A
BTin stream N without a matchingETbefore stream N ends is technically non-conforming, but common. The parser must carry the text object state (current font, text matrix, text line matrix) across the boundary. Treat encounteringBTwithout a priorETas an implicit end of the previous text object followed by the start of a new one — but do not reset state thatETwould not reset (e.g., the graphics state stack depth). - Marked-content nesting:
BMC/BDCandEMCmarkers may straddle stream boundaries (see §10 below). The nesting counter must persist across boundaries.
The implementation consequence: the content stream reader must operate on a logical ContentStreamReader abstraction that owns an iterator over (stream_index, byte_offset, byte) tuples rather than a flat &[u8]. The newline injection happens at the seam between streams, not in the stored data.
3. Resources Dictionary Resolution
Every named resource referenced in a content stream — fonts, XObjects, color spaces, etc. — is resolved through a /Resources dictionary. The lookup path follows the page tree (PDF 1.7 §7.7.3.4):
- Check the page dictionary's own
/Resources. - If absent or the name is not found there, walk up through parent
/Pagesnodes in order, checking each/Resourcesentry. - The first definition found wins — page-level overrides parent-level.
Some generators omit per-page /Resources entirely and rely on the root /Pages node's /Resources. This is permitted by the spec and must be handled. A correct implementation resolves the inherited resource dictionary chain at page load time, before processing any content stream, and caches the result as a merged view (do not re-walk the tree on every resource lookup).
The page-level resource dictionary is the outermost scope in the resource name stack (§6 below).
4. Form XObject Resources
A Form XObject is an XObject stream with /Subtype /Form. It has its own /Resources dictionary (PDF 1.7 §8.10.2). When the Do operator invokes a Form XObject:
- Resource lookup inside the Form's content stream uses the Form's
/Resourcesexclusively. - Page resources are not visible inside the Form. The Form's content stream is a self-contained resource scope.
- A font named
/F1in the page's Resources and a font named/F1in the Form's Resources are independent entries that may refer to different font objects.
The common extraction failure mode: a Tf /F1 12 inside a Form XObject resolves against the Form's /Resources/Font, not the page's. If the font is not declared in the Form's Resources, the resource lookup fails — even if the page declares the same name. The PDF spec does not permit fallback to the parent scope for Form XObjects.
5. Type 3 Font Glyph Stream Resources
A Type 3 font defines each glyph as a content stream stored under /CharProcs. Each glyph stream is parsed in its own resource scope: the /Resources entry in the Type 3 font dictionary (PDF 1.7 §9.6.5).
- Glyph streams do not use the page resources.
- Glyph streams do not use the Form XObject resources (even if the Type 3 font was invoked from inside a Form).
- Only the Type 3 font's own
/Resourcesapplies.
This creates a three-level nesting: page → Form XObject → Type 3 font glyph, each with its own resource scope. The resource name stack (§6) must support all three levels simultaneously.
6. The Resource Name Stack
At any point during parsing, there is an ordered stack of active resource dictionaries. Lookup proceeds from innermost (top of stack) to outermost (bottom):
[page_resources, form_resources?, type3_font_resources?]
^--- bottom (outermost) ^--- top (innermost)
In Rust:
struct ResourceStack<'a> {
stack: Vec<&'a Resources>,
}
impl<'a> ResourceStack<'a> {
fn lookup_font(&self, name: &Name) -> Option<&'a FontRef> {
for resources in self.stack.iter().rev() {
if let Some(font) = resources.fonts.get(name) {
return Some(font);
}
}
None
}
fn push(&mut self, r: &'a Resources) { self.stack.push(r); }
fn pop(&mut self) { self.stack.pop(); }
}
Push on entry to a Form XObject (Do) or Type 3 glyph stream; pop on exit. For Type 3 glyphs, "exit" is the end of the glyph's content stream. For Form XObjects, "exit" is reaching the end of the Form's content stream after the Do operator dispatched into it.
The page-level resources are always the bottom of the stack (index 0) and are never popped during page processing.
7. Operators That Reference Resources
Every operator that references a resource name must resolve that name through the current resource stack:
| Operator | Resource sub-dictionary | Key |
|---|---|---|
Tf name size |
/Font |
font name |
Do name |
/XObject |
XObject name |
cs name |
/ColorSpace |
color space name |
CS name |
/ColorSpace |
color space name |
gs name |
/ExtGState |
graphics state name |
sh name |
/Shading |
shading name |
scn/SCN (with pattern name argument) |
/Pattern |
pattern name |
BDC /OC /Properties name |
/Properties |
marked-content property |
The Tf operator is the most critical for text extraction: it sets the current font, which determines how character codes in subsequent Tj/TJ/'/" operators map to Unicode. An unresolved font name must be surfaced as a recoverable error — the parser should log the failure and skip character mapping for the affected text spans rather than aborting page extraction.
8. Inline Images vs. XObjects in Streams
Inline images are encoded directly in the content stream bytes between BI, ID, and EI operators (PDF 1.7 §8.9.7). The image data between ID and EI is raw binary and may contain arbitrary bytes, including sequences that look like PDF operators.
The EI token is identified as: a EI byte pair preceded by whitespace and followed by whitespace or end of stream. Do not scan for a bare EI pattern inside the image data.
When the image parameters include /W (width), /H (height), and /BPC (bits per component) and no compression filter is applied, the data length is exactly ceil(W * BPC / 8) * H bytes. Parse that many bytes after ID and then expect EI. With a compression filter (/F or /Filter), the data ends at the first whitespace-delimited EI token after decompression. In practice, treating the search as: skip past any initial whitespace after ID, read bytes until a valid whitespace-delimited EI is found (with a heuristic maximum scan length), is the safe fallback for malformed streams.
9. Pages with Many Content Streams
Some generators produce pages with dozens or hundreds of content streams — one per drawing element, update layer, or annotation. This is legal. The page-level concatenation must handle an arbitrary-length /Contents array.
Memory management is essential: do not decompress all streams into memory before parsing. Use a lazy ContentStreamSource that implements Iterator<Item = Result<DecompressedStream>> and decompresses each stream on demand. The parser reads from one stream at a time, injecting the newline separator at each boundary, without buffering more than one stream's bytes in memory.
Complex technical drawings (CAD exports, large engineering PDFs) can produce per-page content totaling hundreds of megabytes after decompression. A pull-based streaming parser is the only viable architecture.
10. Optional Content in Content Streams
Optional Content Groups (OCGs) control the visibility of content regions. They are activated with BMC/BDC (begin marked content) and deactivated with EMC (end marked content). An OCG region opened in stream N may be closed in stream N+1:
% stream 1
/OC /MC0 BDC
BT /F1 12 Tf (Hello) Tj ET
% stream 1 ends here — no EMC
% stream 2
BT /F1 12 Tf (World) Tj ET
EMC
The OCG nesting depth and the current active OCG stack must persist across stream boundaries, just like the graphics state stack. An EMC in stream 2 closes the region opened in stream 1.
The /Properties name referenced in a BDC inline dictionary or as an argument is resolved through the current resource stack's /Properties sub-dictionary, subject to the same innermost-first lookup as all other resource names. An OCG reference inside a Form XObject resolves through the Form's Resources /Properties, not the page's.
Summary
Correct content stream processing requires treating the page's /Contents array as a single logical stream with injected newline separators at each boundary, maintaining continuous parser state (graphics stack, text object, OCG nesting) across those boundaries, and resolving every resource name through a dynamically managed stack of resource dictionaries whose depth reflects the current nesting level (page → Form XObject → Type 3 glyph). Failures in any of these three areas produce incorrect text extraction silently — wrong glyphs, missing text spans, or incorrect reading order — making them the highest-priority correctness requirements for the parser implementation.