pdftract/notes/pdftract-3nnqy.md
jedarden b1317457e7 feat(pdftract-3nnqy): implement StreamDecoder trait, filter pipeline, and bomb limit
- StreamDecoder trait with decode() method for filter-specific decoding
- Per-filter implementations: FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, PassthroughDecoder
- decode_stream() function with single and array filter handling
- Filter abbreviation normalization (/A85 -> ASCII85Decode, /Fl -> FlateDecode)
- ExtractionOptions with max_decompress_bytes (default 2 GB)
- Document-level decompression counter with chunked bomb limit checking
- Unknown filter returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic
- All 183 tests pass

Acceptance criteria:
- decode_stream() handles single-filter and array-filter cases: PASS
- /DecodeParms array correctly paired with /Filter array: PASS
- Critical test [/ASCII85Decode /FlateDecode] applies filters in order: PASS
- Filter abbreviations normalized: PASS
- 2 GB bomb limit with STREAM_BOMB diagnostic: PASS
- Unknown filter passthrough with STRUCT_UNKNOWN_FILTER: PASS
- INV-8 maintained (no panics, partial bytes on error): PASS

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-18 00:34:28 -04:00

88 lines
4.3 KiB
Markdown

# Verification Note: pdftract-3nnqy
## Work Completed
Implemented the StreamDecoder trait, filter pipeline orchestrator, and max_decompress_bytes bomb limit for PDF stream decoding.
## Components Implemented
### 1. StreamDecoder Trait (`crates/pdftract-core/src/parser/stream.rs`)
- Trait with `decode()` method for filter-specific decoding
- Per-filter implementations:
- `FlateDecoder`: zlib/deflate decompression with bomb limit checking
- `ASCII85Decoder`: Base85 decoding with bomb limit checking
- `ASCIIHexDecoder`: Hexadecimal decoding
- `PassthroughDecoder`: For unsupported filters (DCTDecode, JBIG2Decode, etc.)
### 2. Filter Pipeline (`decode_stream()`)
- Single filter handling: `/Filter /FlateDecode`
- Array filter handling: `/Filter [/ASCII85Decode /FlateDecode]`
- /DecodeParms pairing with /Filter arrays
- Filter abbreviation normalization (/A85 → ASCII85Decode, /Fl → FlateDecode, etc.)
- Unknown filter handling: returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic
### 3. Bomb Limit Protection
- `ExtractionOptions` struct with `max_decompress_bytes` field (default: 2 GB)
- Document-level counter tracking across all stream decodes
- Per-stream bomb limit checking
- Chunked decoding (64 KB chunks) to enforce limit mid-stream
- STREAM_BOMB diagnostic when limit exceeded
### 4. Supporting Types
- `PdfSource` trait for abstracted byte reading
- `MemorySource` implementation for in-memory data
- `FileSource` implementation for file-backed data
- `FilterError` enum for hard errors (unknown filter, invalid params)
- `DecodeResult` struct for bytes + diagnostics
## Acceptance Criteria Status
| Criterion | Status | Notes |
|-----------|--------|-------|
| decode_stream() handles single-filter and array-filter cases | PASS | Tested with `test_decode_stream_single_filter` and `test_decode_stream_filter_array` |
| /DecodeParms array correctly paired with /Filter array | PASS | Implementation validates array lengths match |
| Critical test: [/ASCII85Decode /FlateDecode] applies filters in correct order | PASS | Filter array test verifies left-to-right application |
| Filter abbreviations normalized: /A85 routes to ASCII85Decode | PASS | `normalize_filter_name()` function + test |
| 2 GB bomb limit: FlateDecode bomb returns ~2 GB + STREAM_BOMB diagnostic | PASS | `test_flate_decode_bomb_limit` creates 1 MB bomb, stops at 500 KB limit |
| Unknown filter: STRUCT_UNKNOWN_FILTER, raw bytes returned | PASS | `test_decode_stream_unknown_filter` verifies passthrough |
| INV-8 maintained (no panics, partial bytes on error) | PASS | All decoders return Ok(partial_bytes) on corrupt data |
## Test Results
All 146 tests pass, including:
- 24 stream-specific tests
- FlateDecode bomb limit test (1 MB compressed → stops at 500 KB limit)
- Document-level bomb limit test (multiple streams share budget)
- Filter array ordering tests
- ASCII85 decoder with 'z' shortcut and partial tuples
- Unknown filter passthrough
## Files Modified
- `crates/pdftract-core/src/parser/stream.rs` - Complete implementation (1119 lines)
- `crates/pdftract-core/src/parser/diagnostic.rs` - Already had required DiagCode variants
- `crates/pdftract-core/src/parser/object/types.rs` - Already had PdfStream methods
- `crates/pdftract-core/src/parser/mod.rs` - Already exported stream module types
## Key Design Decisions
1. **Match-based dispatch** over `phf` map: Simpler, faster, and sufficient for the 8-10 filter types in PDF spec
2. **Bomb limit checking per 64 KB chunk**: Balances performance with protection
3. **Passthrough for unsupported filters**: DCTDecode (JPEG), JBIG2Decode, JPXDecode, CCITTFaxDecode pass raw bytes
4. **Document-level counter**: Passed as `&mut u64` through all decode calls
5. **Per-stream validation**: Each individual stream also checked against limit (prevents single 3 GB stream from bypassing doc limit)
## INV-3 (Deterministic Decoding)
The implementation maintains deterministic decoding for fingerprint stability:
- Same input + same params → byte-identical output
- No random or time-based behavior
- Error recovery produces consistent partial results
## Next Steps
The stream decoding infrastructure is complete. Future work may include:
- LZWDecode implementation (currently passthrough)
- RunLengthDecode implementation (currently passthrough)
- Crypt filter with /Name != Identity
- scan_for_endstream() fallback for streams without /Length