- graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Graphics State Tracking for PDF Text Extraction
Correct text extraction in pdftract requires more than decoding glyph sequences. Whether a glyph is visible, what color it renders at, and where on the page it lands all depend on state that accumulates across operators in the content stream. Mishandling this state causes invisible text to contaminate output and visible text to be silently dropped.
1. The Graphics State Stack
The PDF content stream is a stateful machine. A graphics state object encapsulates every rendering parameter at a point in the stream. The q operator pushes a complete clone of the current state onto a stack; Q pops and restores it. The PDF specification (ISO 32000-2, §8.4.2) recommends implementations support at least 28 nesting levels.
The state that must be cloned on q includes:
- CTM — current transformation matrix, 6 floats
[a b c d e f] - Clipping path — the active clip region
- Color space and color — separately for fill and stroke
- Line parameters — width, cap style, join style, miter limit, dash pattern
- Rendering intent — a name value
- Stroke adjustment flag
- Blend mode — a name (e.g.,
/Normal,/Multiply) - Soft mask — a dictionary or
/None - Alpha constants —
ca(fill alpha) andCA(stroke alpha), bothf32in[0.0, 1.0] - Alpha is shape flag (
AIS) - Text state — the entire set of text parameters described in §2
A missing or shallow clone on q is a latent bug: an inner content stream that changes color or alpha will corrupt the outer stream's state after Q.
2. Text State Within the Graphics State
Text state is a subgroup of the graphics state and is saved and restored with q/Q. The text state operators and their targets:
| Operator | Parameter modified |
|---|---|
Tf name size |
Current font (resource name) and font size in text space |
Tc value |
Character spacing (added after each glyph, in text space units) |
Tw value |
Word spacing (added after each ASCII space, 0x20) |
Tz value |
Horizontal scaling, expressed as a percentage (100 = normal) |
TL value |
Leading, used by T* and ' operators |
Tr value |
Text rendering mode (integer 0–7) |
Ts value |
Text rise, vertical offset in text space |
Text matrices are separate. The text matrix Tm and the text line matrix Tlm are not part of the graphics state. They are initialized by BT (begin text object) and are undefined outside a BT/ET pair. Td, TD, T*, and Tm modify these matrices during a text object. They are not saved or restored by q/Q. Implementations that try to persist them across q/Q will produce incorrect glyph positions.
3. Color Space Tracking
The current color space for fill and stroke are tracked independently.
Explicit color space selection:
cs name— set fill color space to a named entry fromResources/ColorSpaceCS name— set stroke color space to a named entry fromResources/ColorSpace- Device names
/DeviceRGB,/DeviceGray,/DeviceCMYKcan appear directly
Shorthand operators that set both space and color atomically:
| Operator | Color space | Arguments |
|---|---|---|
rg r g b |
DeviceRGB (fill) | three floats in [0,1] |
RG r g b |
DeviceRGB (stroke) | three floats in [0,1] |
g gray |
DeviceGray (fill) | one float in [0,1] |
G gray |
DeviceGray (stroke) | one float in [0,1] |
k c m y k |
DeviceCMYK (fill) | four floats in [0,1] |
K c m y k |
DeviceCMYK (stroke) | four floats in [0,1] |
General color operators sc/scn (fill) and SC/SCN (stroke) set the color within the currently active color space. The argument count depends on the space.
Normalized luminance for visibility. To determine whether text color contrasts with the page background (typically white), convert to a single luminance value:
- DeviceGray: luminance =
gray - DeviceRGB: luminance =
0.2126 * r + 0.7152 * g + 0.0722 * b(sRGB coefficients per IEC 61966-2-1) - DeviceCMYK: convert to RGB first:
r = (1-c)*(1-k),g = (1-m)*(1-k),b = (1-y)*(1-k), then apply the RGB formula - CalRGB, ICCBased: use the RGB channel values after applying the color space transformation, then the RGB formula
Text with luminance near 1.0 on a white background is invisible regardless of alpha. Track fill color luminance as a f32; any value above approximately 0.95 against a white background should be flagged as potentially invisible.
4. ExtGState Dictionary
The gs name operator loads a graphics state parameter dictionary from Resources/ExtGState. This is the primary mechanism for setting transparency parameters. Keys relevant to text:
| Key | Type | Effect |
|---|---|---|
ca |
number | Fill (non-stroking) alpha constant, 0.0–1.0 |
CA |
number | Stroke alpha constant, 0.0–1.0 |
BM |
name or array | Blend mode |
SMask |
dict or /None |
Soft mask; /None clears any active mask |
AIS |
boolean | Alpha is shape |
SA |
boolean | Stroke adjustment |
Font |
array [ref size] |
Sets current font and size, same effect as Tf |
apply_gs must iterate the dictionary and update only the keys present — absent keys leave the corresponding state unchanged.
SMask dictionary structure. When SMask is a dictionary rather than /None:
S—/Alphaor/Luminosity: determines how the mask value is extracted from the group resultG— a Form XObject stream that is rendered to produce the maskBC— backdrop color (array of color components)TR— transfer function applied to the mask values
5. Clipping Path Management
The initial clipping path for a page is the MediaBox (or CropBox if present). Within content streams, clipping is modified by W (nonzero winding rule) and W* (even-odd rule). These operators are path-painting modifiers: they take effect after path construction is complete and before or instead of a painting operator. The sequence is:
- Construct path via
m,l,c,re, etc. - Issue
WorW*— marks intent to clip - Issue a painting operator (
S,f,n, etc.) or justnto apply the clip without painting
The clip region is intersected with the constructed path shape — it can only shrink, never expand. The resulting clip becomes the new current clipping path.
For text extraction, maintaining an exact polygon clip is expensive. A practical approximation: track the clip as an axis-aligned bounding box ([x_min, y_min, x_max, y_max] in user space). When W/W* fires, intersect the tracked bbox with the bounding box of the current path. For most documents this approximation is exact; non-rectangular clips are edge cases flagged for further analysis.
The clipping path is fully saved and restored by q/Q.
6. Current Transformation Matrix
The CTM is a 3×3 matrix in column-major form, represented by 6 values [a b c d e f] with the third row implicitly [0 0 1]. The cm a b c d e f operator pre-multiplies the current CTM by the new matrix:
CTM_new = [a b c d e f] × CTM_current
In row-vector convention (PDF uses row vectors), concatenation means the new transform is applied first. The implementation must preserve exact multiplication order.
Matrix concatenation:
fn concat(m: [f64; 6], ctm: [f64; 6]) -> [f64; 6] {
[
m[0]*ctm[0] + m[1]*ctm[2],
m[0]*ctm[1] + m[1]*ctm[3],
m[2]*ctm[0] + m[3]*ctm[2],
m[2]*ctm[1] + m[3]*ctm[3],
m[4]*ctm[0] + m[5]*ctm[2] + ctm[4],
m[4]*ctm[1] + m[5]*ctm[3] + ctm[5],
]
}
Glyph positions are computed in text space, transformed by the text matrix Tm, then by the CTM. The resulting device-space coordinates determine where the glyph appears on the page and whether it falls within the clipping bbox.
7. Blend Mode Effects on Visibility
The blend mode controls how a graphics object composites over the content beneath it. For text extraction, the key question is whether the blend mode can render text invisible.
/Normaland/Compatible— the source color replaces the destination at the source's alpha. Atca=1.0, text is fully opaque in its declared color./Multiply— multiplies source and destination color channels. Text drawn in black (0,0,0) on any background remains black. Text drawn in white (1,1,1) becomes invisible against a white background./Screen—1 - (1-s)*(1-d). Light-colored text lightens rather than covers./Overlay,/HardLight,/SoftLight— result depends on the luminance of the destination, which is unknown without rendering./Difference,/Exclusion— text color is the absolute difference with the background.
Practical rule: if blend mode is not /Normal or /Compatible, the actual rendered color cannot be determined without knowing the destination. Flag such text as blend_mode_dependent and rely on ca as the primary visibility signal. A ca of 0.0 guarantees invisibility; any positive value with a non-Normal blend mode is ambiguous.
8. Soft Mask Interaction
A soft mask applies a per-pixel transparency derived from a separately rendered Form XObject. The effective alpha at any pixel is ca * mask_value(x, y). Since mask_value is in [0.0, 1.0], the constant alpha ca is an upper bound on the effective alpha.
Fully rendering the mask Form XObject is expensive and outside the scope of a text extraction pass. The practical approach:
- When
SMaskis set to a dictionary (not/None), set a boolean flagsoft_mask_present: trueon the graphics state. - Use
caas a lower-bound visibility signal: ifca == 0.0, text is invisible regardless of the mask. - For
ca > 0.0with an active soft mask, text is markedsoft_mask_presentand conservatively included in output — it may be partially or fully transparent depending on the mask, but exclusion risks losing real content.
Clearing: gs with SMask /None clears the active soft mask.
9. Form XObject Graphics State Isolation
When Do name invokes a Form XObject, the PDF processor must:
- Save the current graphics state (equivalent to
q) - Concatenate the Form XObject's
/Matrix(if present) with the current CTM - Apply the Form XObject's
/BBoxas an additional clip - Parse the Form XObject's content stream, using its
/Resourcesdictionary for name resolution - Restore the graphics state (equivalent to
Q) when the stream ends
Graphics state mutations inside the Form XObject — color changes, alpha updates, CTM modifications, clip changes — do not persist after the Do operator completes. Resource name resolution switches to the Form XObject's /Resources during parsing and reverts after. Failing to isolate Form XObject state is a common source of color and font state corruption.
10. Implementation: the GraphicsState Struct
#[derive(Clone)]
pub struct GraphicsState {
// Transformation
pub ctm: [f64; 6],
// Color (fill)
pub fill_color_space: ColorSpace,
pub fill_color: ColorValue,
pub fill_alpha: f32,
// Color (stroke)
pub stroke_color_space: ColorSpace,
pub stroke_color: ColorValue,
pub stroke_alpha: f32,
// Transparency
pub blend_mode: BlendMode,
pub soft_mask_present: bool,
// Clipping (bbox approximation in user space)
pub clip_bbox: Option<[f64; 4]>, // [x_min, y_min, x_max, y_max]
// Text state
pub text_rendering_mode: u8, // 0–7 per PDF spec
pub text_rise: f64,
pub font_name: Option<String>,
pub font_size: f64,
pub char_spacing: f64,
pub word_spacing: f64,
pub horiz_scaling: f64, // percentage, default 100.0
pub leading: f64,
}
pub struct GraphicsStateStack {
stack: Vec<GraphicsState>,
}
impl GraphicsStateStack {
pub fn save(&mut self) {
let top = self.stack.last().expect("empty stack").clone();
self.stack.push(top);
}
pub fn restore(&mut self) {
if self.stack.len() > 1 {
self.stack.pop();
}
}
pub fn current(&mut self) -> &mut GraphicsState {
self.stack.last_mut().expect("empty stack")
}
}
is_text_visible must combine all signals:
pub fn is_text_visible(&self) -> bool {
// Rendering mode 3 = invisible (clip only)
if self.text_rendering_mode == 3 { return false; }
// Zero alpha = invisible
if self.fill_alpha == 0.0 { return false; }
// Clipping: if clip bbox has zero area, text is outside
if let Some(bbox) = self.clip_bbox {
if bbox[0] >= bbox[2] || bbox[1] >= bbox[3] { return false; }
}
// High-luminance fill on assumed white background
let lum = self.fill_color.luminance();
if lum > 0.95 && self.fill_alpha > 0.0 { return false; }
true
}
apply_gs iterates the ExtGState dictionary entries and applies each recognized key. Unknown keys are ignored per the spec's extensibility rules.
apply_cm calls the concat function above to pre-multiply the new matrix into the current CTM.
Summary
Full graphics state tracking is not optional for accurate text extraction. The rendering mode, alpha constants, blend mode, soft mask, fill color, clipping path, and CTM each independently contribute to whether a glyph appears on the page and where. The stack mechanics of q/Q must clone the complete state. Form XObjects must isolate their state changes. Text matrices (Tm, Tlm) are separate from the graphics state and must not be conflated with it. The is_text_visible predicate synthesizes all tracked signals into a single decision that drives inclusion or exclusion of glyphs from the extraction output.