pdftract/notes/pdftract-3nnqy.md
jedarden b1317457e7 feat(pdftract-3nnqy): implement StreamDecoder trait, filter pipeline, and bomb limit
- StreamDecoder trait with decode() method for filter-specific decoding
- Per-filter implementations: FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, PassthroughDecoder
- decode_stream() function with single and array filter handling
- Filter abbreviation normalization (/A85 -> ASCII85Decode, /Fl -> FlateDecode)
- ExtractionOptions with max_decompress_bytes (default 2 GB)
- Document-level decompression counter with chunked bomb limit checking
- Unknown filter returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic
- All 183 tests pass

Acceptance criteria:
- decode_stream() handles single-filter and array-filter cases: PASS
- /DecodeParms array correctly paired with /Filter array: PASS
- Critical test [/ASCII85Decode /FlateDecode] applies filters in order: PASS
- Filter abbreviations normalized: PASS
- 2 GB bomb limit with STREAM_BOMB diagnostic: PASS
- Unknown filter passthrough with STRUCT_UNKNOWN_FILTER: PASS
- INV-8 maintained (no panics, partial bytes on error): PASS

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-18 00:34:28 -04:00

4.3 KiB

Verification Note: pdftract-3nnqy

Work Completed

Implemented the StreamDecoder trait, filter pipeline orchestrator, and max_decompress_bytes bomb limit for PDF stream decoding.

Components Implemented

1. StreamDecoder Trait (crates/pdftract-core/src/parser/stream.rs)

  • Trait with decode() method for filter-specific decoding
  • Per-filter implementations:
    • FlateDecoder: zlib/deflate decompression with bomb limit checking
    • ASCII85Decoder: Base85 decoding with bomb limit checking
    • ASCIIHexDecoder: Hexadecimal decoding
    • PassthroughDecoder: For unsupported filters (DCTDecode, JBIG2Decode, etc.)

2. Filter Pipeline (decode_stream())

  • Single filter handling: /Filter /FlateDecode
  • Array filter handling: /Filter [/ASCII85Decode /FlateDecode]
  • /DecodeParms pairing with /Filter arrays
  • Filter abbreviation normalization (/A85 → ASCII85Decode, /Fl → FlateDecode, etc.)
  • Unknown filter handling: returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic

3. Bomb Limit Protection

  • ExtractionOptions struct with max_decompress_bytes field (default: 2 GB)
  • Document-level counter tracking across all stream decodes
  • Per-stream bomb limit checking
  • Chunked decoding (64 KB chunks) to enforce limit mid-stream
  • STREAM_BOMB diagnostic when limit exceeded

4. Supporting Types

  • PdfSource trait for abstracted byte reading
  • MemorySource implementation for in-memory data
  • FileSource implementation for file-backed data
  • FilterError enum for hard errors (unknown filter, invalid params)
  • DecodeResult struct for bytes + diagnostics

Acceptance Criteria Status

Criterion Status Notes
decode_stream() handles single-filter and array-filter cases PASS Tested with test_decode_stream_single_filter and test_decode_stream_filter_array
/DecodeParms array correctly paired with /Filter array PASS Implementation validates array lengths match
Critical test: [/ASCII85Decode /FlateDecode] applies filters in correct order PASS Filter array test verifies left-to-right application
Filter abbreviations normalized: /A85 routes to ASCII85Decode PASS normalize_filter_name() function + test
2 GB bomb limit: FlateDecode bomb returns ~2 GB + STREAM_BOMB diagnostic PASS test_flate_decode_bomb_limit creates 1 MB bomb, stops at 500 KB limit
Unknown filter: STRUCT_UNKNOWN_FILTER, raw bytes returned PASS test_decode_stream_unknown_filter verifies passthrough
INV-8 maintained (no panics, partial bytes on error) PASS All decoders return Ok(partial_bytes) on corrupt data

Test Results

All 146 tests pass, including:

  • 24 stream-specific tests
  • FlateDecode bomb limit test (1 MB compressed → stops at 500 KB limit)
  • Document-level bomb limit test (multiple streams share budget)
  • Filter array ordering tests
  • ASCII85 decoder with 'z' shortcut and partial tuples
  • Unknown filter passthrough

Files Modified

  • crates/pdftract-core/src/parser/stream.rs - Complete implementation (1119 lines)
  • crates/pdftract-core/src/parser/diagnostic.rs - Already had required DiagCode variants
  • crates/pdftract-core/src/parser/object/types.rs - Already had PdfStream methods
  • crates/pdftract-core/src/parser/mod.rs - Already exported stream module types

Key Design Decisions

  1. Match-based dispatch over phf map: Simpler, faster, and sufficient for the 8-10 filter types in PDF spec
  2. Bomb limit checking per 64 KB chunk: Balances performance with protection
  3. Passthrough for unsupported filters: DCTDecode (JPEG), JBIG2Decode, JPXDecode, CCITTFaxDecode pass raw bytes
  4. Document-level counter: Passed as &mut u64 through all decode calls
  5. Per-stream validation: Each individual stream also checked against limit (prevents single 3 GB stream from bypassing doc limit)

INV-3 (Deterministic Decoding)

The implementation maintains deterministic decoding for fingerprint stability:

  • Same input + same params → byte-identical output
  • No random or time-based behavior
  • Error recovery produces consistent partial results

Next Steps

The stream decoding infrastructure is complete. Future work may include:

  • LZWDecode implementation (currently passthrough)
  • RunLengthDecode implementation (currently passthrough)
  • Crypt filter with /Name != Identity
  • scan_for_endstream() fallback for streams without /Length