jedarden a7673c906f Add 12 research documents covering full PDF extraction surface

Infrastructure and parsing:
- raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration,
  assisted OCR, HOCR alignment, multi-language, performance
- image-and-figure-extraction: XObjects, inline images, filter decoding,
  color spaces, geometry, form XObjects, transparency, figure detection
- form-fields-and-annotations: AcroForm types, XFA, widget appearance
  streams, rich text, annotation text, output schema
- pdf-encryption-and-security: R2-R6 key derivation, object-level
  decryption, permission flags, RustCrypto implementation approach
- page-geometry-and-document-structure: page tree, all five page boxes,
  rotation, coordinate inversion, page labels, outlines, named destinations
- optional-content-groups: OCG/OCMD visibility, usage dictionary, default
  state resolution, content stream marking, multilingual layer patterns
- invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern,
  white-on-white, zero-opacity, clipped text, color tracking
- malformed-pdf-repair-and-recovery: xref recovery, stream length repair,
  syntax tolerance, partial extraction, structured warnings

Quality and metadata:
- xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML
  parsing, conflict resolution, encrypted metadata, thumbnails
- embedded-files-and-portfolios: EmbeddedFile streams, Filespec,
  AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security
- performance-and-streaming-architecture: mmap, lazy loading, NDJSON
  streaming, rayon parallelism, font caching, axum HTTP server
- benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus
  categories, reading order scoring, regression CI, public datasets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:05:42 -04:00

13 KiB

Raw Permalink Blame History

Image and Figure Extraction in PDF

1. Image XObjects

PDF images are most commonly embedded as Image XObjects. An XObject is an indirect object whose dictionary contains /Type /XObject and /Subtype /Image. It is invoked from a content stream using the Do operator:

/ImageName Do

where ImageName is a name key in the current resource dictionary's /XObject subdictionary that maps to the image's indirect reference.

The XObject dictionary must contain:

Key	Type	Description
`Width`	integer	Pixel width of the raster
`Height`	integer	Pixel height of the raster
`ColorSpace`	name or array	Color space of the image samples
`BitsPerComponent`	integer	Bits per color component (1, 2, 4, 8, or 16)
`Filter`	name or array	Compression filter(s) applied to the stream

The stream body following the dictionary contains the raw (filtered) image data. BitsPerComponent is omitted when the image uses a mask or JBIG2 encoding where bit depth is implied.

Positioning via the CTM

The Do operator renders the image into a 1×1 unit square anchored at the origin. The current transformation matrix (CTM) at the point of Do invocation maps that unit square into page space. A canonical image placement looks like:

q
72 0 0 96 144 432 cm
/Im1 Do
Q

The cm operator concatenates [72 0 0 96 144 432] onto the CTM: this scales the unit square to 72×96 points and translates its origin to (144, 432) in page coordinates. The rendered bounding box in page units is thus derived from the CTM columns — specifically, the x-extent is the length of the first column vector and the y-extent is the length of the second column vector. When the matrix contains rotation or shear, the bounding box must be computed as the convex hull of the four transformed corners: (0,0), (1,0), (0,1), (1,1).

2. Inline Images

Inline images embed pixel data directly into the content stream using a three-operator sequence:

BI
  /W 320 /H 240 /CS /RGB /BPC 8 /F /DCT
ID
<binary image data>
EI

The BI (Begin Image) operator introduces the inline dictionary. Key abbreviations are standardized:

Abbreviation	Full key
`/W`	`Width`
`/H`	`Height`
`/CS`	`ColorSpace`
`/BPC`	`BitsPerComponent`
`/F`	`Filter`
`/DP`	`DecodeParms`

ID (Image Data) marks the transition from dictionary to binary payload. The parser must switch to raw byte mode immediately after the whitespace following ID. The payload ends at the next unescaped EI token. Reliably detecting EI requires either tracking the filter's expected byte count or scanning for EI preceded by a whitespace character.

Inline images are limited to simpler use cases: they cannot be referenced by name, cannot be reused across content streams, and are typically restricted to JPEG, CCITT, or uncompressed data. They carry no indirect object overhead but complicate stream parsing significantly.

3. Filter Decoding

Both XObject streams and inline images may be compressed with one or more filters listed in the /Filter key (a name for a single filter, or an array for a chain). Filters are applied in array order during encoding; decoding reverses the chain.

Common filters:

DCTDecode — JPEG (ISO 10918). The stream is a complete JFIF/JPEG file. DecodeParms may specify ColorTransform (0 = no transform, 1 = YCbCr→RGB, -1 = automatic). Standard JPEG decoders handle the DCT coefficients, quantization, and Huffman decoding.
JPXDecode — JPEG 2000 (ISO 15444). The stream is a complete JP2 or J2C codestream. Color space information may be embedded in the JP2 container; when present it overrides the PDF /ColorSpace key.
JBIG2Decode — Bi-level (1 bpp) compression. DecodeParms may contain a JBIG2Globals key whose value is a stream containing the global segment data that must be prepended before decoding. Requires a full JBIG2 decoder (e.g., the jbig2dec library via FFI).
CCITTFaxDecode — Group 3 or Group 4 fax encoding. DecodeParms specifies K (0=Group3 1D, -1=Group4, positive=Group3 2D with K rows between EOL), Columns, Rows, BlackIs1, EncodedByteAlign.
FlateDecode — zlib/deflate (RFC 1950 wrapper). DecodeParms may specify a PNG predictor via Predictor (10–15 for PNG filter types None/Sub/Up/Average/Paeth applied row-by-row). After inflation, the predictor must be undone row by row.
LZWDecode — LZW compression. DecodeParms supports EarlyChange (1 by default, meaning the code size increases one code early). The LZW variant matches the TIFF LZW convention, not GIF.
RunLengthDecode — PackBits run-length encoding. Each control byte n signals either (257-n) copies of the next byte (if n > 128) or (n+1) literal bytes (if n < 128). Byte 128 signals end-of-data.

For chained filters — e.g., [/ASCII85Decode /FlateDecode] — the decoder applies ASCII85 first to produce binary, then inflates the result.

4. Color Spaces

The /ColorSpace value determines how to interpret decoded sample bytes.

Device spaces are the simplest: DeviceGray (1 component, 0=black, 1=white), DeviceRGB (3 components), DeviceCMYK (4 components, subtractive).

Calibrated spaces embed a viewing condition: CalGray and CalRGB specify a WhitePoint, optional BlackPoint, and Gamma/Matrix. Lab uses the CIE Lab* model with WhitePoint and Range bounds on a* and b*.

ICCBased references an embedded ICC profile stream. The profile's N value gives the component count. ICC profiles provide a precise device-independent color path; for sRGB output, apply the ICC forward transform to XYZ and then the sRGB matrix.

Indexed defines a palette: [/Indexed base hival lookup]. The base space specifies the color model of palette entries; hival is the maximum index (palette has hival+1 entries); lookup is either a string or stream of (hival+1) * N bytes where N is the component count of the base space. Each 8-bit sample is a palette index.

Separation addresses a single named colorant: [/Separation name alternateSpace tintTransform]. The tint transform (a PDF function) maps a tint value [0,1] to the alternate space. When the target device does not support the colorant, apply the tint transform as a fallback to the alternate space (which is typically DeviceCMYK or DeviceRGB).

DeviceN generalizes Separation to multiple colorants: [/DeviceN names alternateSpace tintTransform attributes]. Each channel maps to a named colorant; the tint transform maps the N-component input to the alternate space.

For pipeline output, convert all color spaces to sRGB: device spaces use standard matrices (CMYK to RGB: R=1-min(1,C+K), G=1-min(1,M+K), B=1-min(1,Y+K)); calibrated/ICC spaces go through XYZ intermediate.

5. Image Geometry

The CTM at Do invocation is a 3×3 affine matrix stored as six values [a b c d e f], representing:

| a  b  0 |
| c  d  0 |
| e  f  1 |

The rendered width in page units is sqrt(a² + b²) and the rendered height is sqrt(c² + d²). Rotation angle is atan2(b, a). Shear is present when the dot product (a·c + b·d) is nonzero.

DPI is computed as:

dpi_x = Width_px / (rendered_width_pts / 72.0)
dpi_y = Height_px / (rendered_height_pts / 72.0)

For rotated images, the bounding box in axis-aligned page coordinates is the AABB of the four corners produced by transforming (0,0), (1,0), (0,1), (1,1) through the CTM.

6. Form XObjects

A /Subtype /Form XObject is a self-contained reusable content stream — not an AcroForm widget. It may embed text, images, paths, and other XObjects (including nested Form XObjects). The parser must recurse into each Form XObject encountered via Do.

The Form XObject dictionary includes:

/BBox — the bounding box of the form in its own local coordinate system.
/Matrix (optional) — a transformation applied before the form's content stream executes, in addition to the invoking CTM.
/Resources — its own resource dictionary, independent of the page's.

When a Do operator names a Form XObject, the renderer pushes the current graphics state, concatenates the form's /Matrix onto the CTM, clips to /BBox, then executes the content stream. The combined CTM (page CTM × form matrix) must be tracked at every Do invocation inside the form to correctly compute image geometry.

7. Soft Masks and Transparency

Images participate in the PDF transparency model in three ways:

ImageMask (boolean) — when true, the image is a 1-bpp stencil. Samples with value 0 paint the current color; samples with value 1 are transparent. ColorSpace and BitsPerComponent are not used.
/Mask — either a color key mask (an array of [min max] pairs per component defining transparent ranges) or a reference to a 1-bpp image stream serving as a hard mask.
/SMask — a reference to a grayscale image stream interpreted as an alpha channel (0=fully transparent, 255=fully opaque). The SMask stream is itself an Image XObject with its own filter, width, height, and ColorSpace of DeviceGray.

For figure detection purposes, an image that is pure stencil (ImageMask=true) or has very low average alpha may be a decorative overlay rather than a content figure.

8. Detecting Figure Regions

Building the figure inventory for a page:

Walk the content stream, tracking the CTM stack. At each Do invocation for an Image XObject, compute the AABB bounding box in page coordinates and record the image metadata.
Apply size thresholds: images smaller than a minimum area threshold (e.g., 1% of page area) are likely icons or decorative glyphs; images covering more than 90% of the page are likely scanned-page backgrounds.
Apply position heuristics: watermarks are typically centered and semi-transparent; logos appear near page margins and are small; content figures appear in the body region with substantial rendered area.
Caption association: scan for text runs within a vertical proximity band (e.g., ±2× line-height) below or above the image bounding box. Text beginning with "Figure", "Fig.", or a numeric pattern is a strong caption signal. Associate the nearest qualifying text run as the figure's caption.
Detect full-page images: rendered size within 5% of the page's MediaBox and positioned at or near the origin — flag as scanned_page.

9. Output Representation

Each detected figure is emitted as a JSON object in the extraction output:

{
  "kind": "figure",
  "page": 3,
  "bbox": [72.0, 300.0, 540.0, 600.0],
  "width_px": 1200,
  "height_px": 900,
  "dpi": 144.0,
  "color_space": "DeviceRGB",
  "filter": ["DCTDecode"],
  "caption": "Figure 4. Loss curves over 100 training epochs.",
  "image_b64": "<base64-encoded PNG or raw raster, present only if extract_images=true>"
}

bbox is [x_min, y_min, x_max, y_max] in PDF page units (origin at bottom-left per PDF convention, or converted to top-left depending on the caller's coordinate system preference). dpi is the effective horizontal DPI rounded to two decimal places. filter is the original filter chain as decoded from the XObject dictionary. caption is null when no caption is detected.

10. Extraction Use Cases and Caller Options

Decoding image bytes is expensive — allocating a decompressed raster for a 300 DPI full-page image at A4 size requires ~25 MB. The extraction pipeline exposes two caller-controlled options:

extract_images: bool — when false, only metadata (bbox, width_px, height_px, dpi, color_space, filter) is emitted. The image stream is not decompressed. This is the default for text-extraction workflows where image content is not needed.
max_image_dpi: u32 — when extract_images is true, images whose effective DPI exceeds this threshold are downsampled before encoding. The downsampled dimensions are round(rendered_width_pts / 72.0 * max_image_dpi) × round(rendered_height_pts / 72.0 * max_image_dpi). A common default is 150 DPI for document previews or 300 DPI for archival quality.

For very large images (e.g., 10,000×10,000 px TIFF-equivalent embedded as FlateDecode), the decoder should process row-by-row rather than inflating the entire stream into a contiguous buffer. FlateDecode with PNG predictors naturally supports row-granularity streaming: inflate one PNG-filtered row, unfilter it (applying Sub/Up/Average/Paeth as indicated), emit to the output buffer, then continue. This keeps peak memory bounded to 2 × stride_bytes regardless of image height.

JBIG2 and JPEG 2000 streams require external codec libraries; callers without FFI dependencies available should fall back to emitting raw stream bytes under a raw_stream_b64 key rather than failing. The filter field in the output indicates which codec is needed for the caller to decode the bytes independently.

13 KiB Raw Permalink Blame History Unescape Escape