jedarden 516ca154aa Add research: page labels, government forms, book publishing, filter decoding

Four new extraction research documents covering page label/PageLabels
number tree and outline/bookmark tree extraction, government form PDF
patterns (IRS, USCIS, court filings, classification markings), book and
publishing PDF structure (running heads, footnotes, index extraction),
and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global
segments, CCITTFax, JPX, error boundaries).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:55:08 -04:00

9.7 KiB

Raw Permalink Blame History

PDF Stream Filters, Image Compression, and Decoding for Text Extraction

Overview

PDF content streams and image XObjects are almost never stored as raw bytes — they pass through one or more compression filters before being written to the file. pdftract must reverse exactly the sequence of filters applied at write time before raw pixel data becomes accessible. A single mishandled filter leaves an entire page blank; a crash inside a decoder can abort extraction for every subsequent page. This document covers each filter pdftract must support, the parameters that govern its behavior, and the error-handling discipline required to survive malformed streams.

The Filter Pipeline

The /Filter entry in a stream dictionary may be either a single name (e.g., /FlateDecode) or an array of names (e.g., [/ASCII85Decode /FlateDecode]). When an array is present, the filters are listed in the order they were applied during encoding, which means pdftract must apply decoders in the same order: the first filter in the array is decoded first, its output fed into the second decoder, and so on. The companion /DecodeParms entry mirrors this structure — either a single parameter dictionary or an array of dictionaries (or null entries for filters that take no parameters) aligned positionally with the filter array.

pdftract must treat /Filter and /DecodeParms as a paired pipeline. If /DecodeParms is shorter than /Filter or contains null entries, the corresponding decoders apply their defaults. Any count mismatch is malformed-but-recoverable: apply defaults for the unpaired stages and log the discrepancy.

FlateDecode

FlateDecode is the dominant filter in modern PDFs, used for content streams, embedded font data, image data, and cross-reference streams (since PDF 1.5). The payload is a standard zlib stream (RFC 1950 wrapping deflate), so the inflate step is straightforward with any conformant zlib implementation.

The complication lies in the /Predictor parameter inside /DecodeParms. A predictor value of 1 (or absent) means no prediction was applied. A value of 2 indicates TIFF predictor 2 (horizontal differencing): each sample is stored as a delta from the previous sample on the same row. Reconstruction adds each delta to a running accumulator, column by column.

PNG predictors occupy values 10 through 15. Value 10 is None; 11 is Sub (delta from the left pixel); 12 is Up (delta from above); 13 is Average (floor of left + above, divided by 2); 14 is Paeth. Value 15 means the optimal predictor was chosen per-row: each row is prefixed by a single byte naming the predictor for that row, which pdftract must read and strip before applying the inverse transform.

Correct FlateDecode requires: inflate the zlib stream, then iterate over rows applying the inverse predictor using /Columns, /Colors, and /BitsPerComponent to determine row stride. Skipping the predictor step scrambles pixel data in a way that produces garbage OCR output without triggering any obvious decode error.

LZWDecode

LZWDecode is the predecessor to FlateDecode, defined since PDF 1.0 and still present in documents from early desktop publishing. It uses LZW with 9–12 bit codes. The /EarlyChange parameter is critical: a value of 1 (default) means the encoder incremented the code width one entry early, before the table filled. A value of 0 selects late change, matching a stricter pre-PDF-1.2 interpretation. Decoding with the wrong setting produces plausible but incorrect bytes with no detectable error. LZWDecode supports the same /Predictor mechanism as FlateDecode, and pdftract must apply the identical post-decompression reconstruction.

ASCII85Decode and ASCIIHexDecode

These filters provide ASCII armor for binary data, historically used to safely transmit PDFs over channels that corrupt eight-bit bytes. Both still appear in PDFs generated by certain print workflows.

ASCII85Decode encodes every four binary bytes as five printable characters in the range ! through u (ASCII 33–117), representing the base-85 digits of a 32-bit big-endian value. An all-zero group is represented by the single character z instead of !!!!!. The stream terminates with ~>, and whitespace is ignored throughout. A final group of fewer than four bytes is padded to four, encoded, and only the first (n+1) characters of the five-character result are emitted. pdftract must handle partial final groups, the z shortcut, and embedded whitespace.

ASCIIHexDecode is simpler: each byte is two hex digits (upper or lower case), whitespace ignored, terminated by >. pdftract reads digit pairs until the terminator.

DCTDecode (JPEG)

DCTDecode wraps a standard JPEG bitstream. The data is a complete JPEG file including SOI and EOI markers, so pdftract passes it directly to Tesseract without re-encoding, preserving quality and avoiding unnecessary decode-reencode cycles.

The /ColorTransform parameter controls color space interpretation. For three-component images, a value of 1 (the default) means YCbCr, requiring conversion to RGB before use; a value of 0 means the data is already RGB. For four-component CMYK JPEG, the default is 0 (no transform); a value of 1 means Adobe YCCK encoding. CMYK requires conversion to RGB before Tesseract can process it — the standard inversion formula (R = 255 − (C × (255 − K) / 255) − K, and similarly for G and B) is adequate for OCR purposes.

JPEG restart markers (RST0–RST7) partition entropy-coded data into independently decodable segments; a conformant JPEG library handles them transparently.

JPXDecode (JPEG 2000)

JPXDecode wraps a JPEG 2000 bitstream, available since PDF 1.5, and is commonly used for high-resolution scans. A JPEG 2000 stream in PDF is a self-contained JP2 file that may embed an ICC color profile in its JP2 header box structure.

For OCR preprocessing, pdftract decodes the JP2 stream to a raw pixel array, applies any embedded ICC profile conversion to reach standard RGB or grayscale, and passes the result to Tesseract. The OpenJPEG library provides a well-tested open-source decoder. pdftract must treat memory allocation failure as a recoverable per-image error — JPEG 2000 images often expand to tens of megabytes — rather than a fatal condition.

JBIG2Decode

JBIG2 is a bi-level (one bit per pixel) compression standard that achieves very high compression ratios on scanned text by identifying repeated symbol shapes. It is extremely common in scanned PDFs produced by office copiers and document management systems.

PDF embeds JBIG2 as two parts: an optional global segment stream in a separate XObject (referenced by /JBIG2Globals in /DecodeParms) containing shared symbol dictionaries, and per-page segment data in the filter stream. pdftract must fetch and retain the global stream for the lifetime of the document, prepend it to each page's local segments, and present the assembled bitstream as a single coherent JBIG2 file to the decoder.

Failing to prepend the global dictionary causes the decoder to fail on every symbol reference, producing a blank or garbage image with no clear error. The libjbig2dec library handles two-part assembly correctly when segments arrive in order.

CCITTFaxDecode

CCITTFaxDecode encodes bi-level images using fax standards. The /K parameter selects the algorithm: 0 is Group 3 one-dimensional (T.4 1D); a positive integer is Group 3 mixed (rows alternate between 1D and 2D, with at most K consecutive 2D rows); −1 is Group 4 two-dimensional (T.6), the most compact and most common in PDFs.

/EndOfLine (default false) indicates whether each row ends with an EOL code; /EncodedByteAlign (default false) forces rows to start on byte boundaries; /Columns gives image width; /Rows gives height. pdftract must pass all four values to the CCITT decoder. An incorrect /Columns misaligns every row, producing text that appears diagonally shredded — visually obvious but not always self-diagnosing.

RunLengthDecode

RunLengthDecode uses a simple packet encoding: a byte in 0–127 means the next (byte+1) bytes are literal; 129–255 means the next single byte is repeated (257−byte) times; 128 is the end-of-data marker. This filter is rare in modern PDFs, appearing mainly in older bi-level and indexed-color images. The decoder is straightforward and unlikely to be a source of failure.

Filter Error Handling

Malformed filter data is a fact of life for any PDF reader operating on documents from diverse sources. Corrupted streams, truncated downloads, and PDF generators that miscount stream lengths all produce inputs no conformant decoder can process. pdftract must apply a consistent recovery discipline: isolate filter decoding for each image XObject in its own error boundary; on any decode failure (zlib checksum error, premature end-of-stream, invalid ASCII85 symbol, missing JBIG2 global dictionary), log the stream object number and the failure reason, mark the image undecodable, and continue processing remaining content on the page.

Partial decode is sometimes better than discarding the image entirely. For FlateDecode and LZWDecode, inflate output up to the point of failure often contains complete rows of pixel data. pdftract should attempt partial decode for these filters, padding the output to the expected dimensions with white pixels before passing to Tesseract, which handles sparse input gracefully. For JBIG2 and DCT, partial data is not usable and should be discarded.

Invalid /DecodeParms values — unknown predictor codes, out-of-range /K, negative /Columns — must not cause panics. pdftract validates all parameters on parse, substitutes safe defaults for out-of-range values, and logs the substitution. No malformed stream should prevent extraction of text that is correctly encoded elsewhere in the same document.

9.7 KiB Raw Permalink Blame History Unescape Escape