Implement the parser for PDF 1.5+ object streams with: - Decompression via Phase 1.5 stream decoder - Arc<RwLock<HashMap>> caching for thread-safe access - /Extends chain support with cycle detection - Depth limit (MAX_EXTENDS_DEPTH = 16) for adversarial protection - get_object() API for xref type-2 entry resolution Acceptance criteria verified: - Critical test: N=10 objects all dereference correctly - /Extends chain: both ObjStms' objects dereference correctly - Cyclic /Extends: emits STRUCT_CIRCULAR_REF - Truncated ObjStm: partial objects + diagnostic - Decompression bomb: emits STREAM_BOMB - Cache hit: returns cached Arc (Arc::ptr_eq verified) Unit tests: 12 tests covering all acceptance criteria and edge cases. Refs: pdftract-6bxw, plan Phase 1.2 line 1072
5.2 KiB
Verification Note: pdftract-6bxw - Object Stream (ObjStm) Parser
Task
Implement object stream (ObjStm) parser with decompress, cache, and /Extends chain.
Implementation Summary
Files Modified
crates/pdftract-core/src/parser/objstm.rs- Complete ObjStm parser implementation
Implementation Details
The ObjectStmParser provides:
- Decompression: Uses Phase 1.5's
decode_stream()function to decompress ObjStm stream data - Caching:
Arc<RwLock<HashMap<ObjRef, ObjStmCacheEntry>>>for thread-safe caching - Extends chain: Recursive loading with cycle detection and depth limit (MAX_EXTENDS_DEPTH = 16)
- API:
get_object(host_objstm_ref, embedded_index, source, resolve_fn)- Main API for xref type-2 entry resolutionload_object_stream(obj_stm_ref, stream_dict, source, resolve_fn)- Bulk loading APIget_cached(obj_ref)- Check cacheis_cached(obj_ref)- Check if cachedtake_diagnostics()- Get accumulated diagnostics
Key Features
-
Object Stream Format:
- Header: N pairs of (object_number, offset) in first
/Firstbytes - Body: N embedded objects (no
obj/endobjwrapper) - Optional
/Extends N G Rfor chain to parent ObjStm
- Header: N pairs of (object_number, offset) in first
-
Error Handling:
MissingKey: Required/Nor/FirstmissingInvalidFormat: Malformed header or dataCircularRef: Cycle detected in/ExtendschainDepthExceeded:/Extendschain exceeds 16 levelsDecompressionFailed: Stream decompression failed
-
Safety:
- Decompression bomb limit enforced (max_decompress_bytes)
- Embedded streams rejected (spec violation)
- Generation number must be 0 for embedded objects
- Thread-safe caching with Arc and RwLock
Unit Tests
The following tests verify all acceptance criteria:
-
test_parse_simple_objstm - Basic ObjStm with N=2 objects
- Creates flate-compressed stream with header "1 0 2 3" and objects "42", "true"
- Verifies both objects parse correctly
-
test_parse_objstm_10_objects - CRITICAL TEST (Acceptance Criterion 1)
- Creates ObjStm with N=10 objects of all types
- Verifies all 10 objects dereference correctly by 0-based index
-
test_objstm_extends_chain - Extends chain (Acceptance Criterion 2)
- Parent ObjStm with 3 objects, child ObjStm with 2 objects extending parent
- Verifies both ObjStms' objects are accessible
-
test_circular_ref_detection - Cyclic /Extends (Acceptance Criterion 3)
- ObjStm with
/Extendspointing to itself - Verifies
CircularReferror is emitted
- ObjStm with
-
test_truncated_objstm_body - Truncated ObjStm (Acceptance Criterion 4)
- ObjStm where last object is truncated ("fal" instead of "false")
- Verifies partial objects returned and diagnostics emitted
-
test_decompression_bomb_objstm - Decompression bomb (Acceptance Criterion 5)
- ObjStm with very small max_decompress_bytes limit
- Verifies STREAM_BOMB diagnostic emitted
-
test_cache_hit - Cache verification (Acceptance Criterion 6)
- Loads same ObjStm twice
- Verifies second call returns cached Arc via
Arc::ptr_eq
-
test_get_object_api - Xref type-2 entry resolution API
- Tests the
get_object()method with 0-based index - Verifies caching on second call
- Tests the
-
test_embedded_stream_rejected - Embedded stream detection
- Verifies embedded objects that are Streams are rejected with diagnostic
-
test_extends_depth_exceeded - Depth limit enforcement
- Creates chain of 17 ObjStms (exceeds MAX_EXTENDS_DEPTH of 16)
- Verifies
DepthExceedederror
-
test_missing_key_n - Missing /N key
- Verifies
MissingKeyerror when /N is absent
- Verifies
-
test_missing_key_first - Missing /First key
- Verifies
MissingKeyerror when /First is absent
- Verifies
Acceptance Criteria Status
| Criterion | Status | Test |
|---|---|---|
| Critical test: N=10 objects all dereference | ✅ PASS | test_parse_objstm_10_objects |
| /Extends chain: both ObjStms' objects dereference | ✅ PASS | test_objstm_extends_chain |
| Cyclic /Extends: emits STRUCT_CIRCULAR_REF | ✅ PASS | test_circular_ref_detection |
| Truncated ObjStm: partial objects + diagnostic | ✅ PASS | test_truncated_objstm_body |
| Decompression bomb: emits STREAM_BOMB | ✅ PASS | test_decompression_bomb_objstm |
| Cache hit: returns cached Arc | ✅ PASS | test_cache_hit |
Integration Points
-
Phase 1.3 (xref): The
get_object()method is designed to be called by the xref resolver when it encounters a type-2 (compressed) xref entry. The API signature matches the xref resolver's needs. -
Phase 1.5 (stream decoder): Uses
decode_stream()function to decompress the ObjStm stream data with full filter pipeline support. -
Diagnostics: Emits diagnostics using the unified
crate::diagnosticsmodule with proper error codes.
Notes
- The ObjStm parser implementation is complete and all acceptance criteria are met
- Unit tests cover all critical paths and edge cases
- The code follows the existing patterns in the codebase (Arc/RwLock for caching, Result types for errors)
- Thread-safe design allows concurrent access from multiple threads (important for rayon parallelism in Phase 4)