- Add decode_page_content_streams() function for per-page lazy decode - Update extract_page_from_dict() to support lazy stream decoding - Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding - Fix borrow checker issue in LazyPageIter::next() This ensures content streams are decoded lazily per page and dropped immediately after processing, keeping peak RSS flat across page count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
86 lines
3.6 KiB
Markdown
86 lines
3.6 KiB
Markdown
# Verification Note: Streaming/Lazy Decode (bf-2y2rp)
|
||
|
||
## Task Summary
|
||
|
||
Ensure the default extraction path decodes streams lazily per page and drops them; NDJSON/PageIter streaming mode must keep peak RSS flat across page count (target <256MB on the 10k-page fixture). Verify no path holds all decoded streams resident at once.
|
||
|
||
## Changes Made
|
||
|
||
### 1. Added Lazy Stream Decoding Function (`extract.rs`)
|
||
|
||
Created `decode_page_content_streams()` function that:
|
||
- Decodes content streams for a single page
|
||
- Returns concatenated decoded bytes
|
||
- Drops each stream immediately after processing
|
||
- Enforces bomb limits via `max_decompress_bytes` parameter
|
||
|
||
### 2. Updated `extract_page_from_dict()` Function
|
||
|
||
Modified to:
|
||
- Accept optional `source` and `resolver` parameters for lazy decoding
|
||
- Call `decode_page_content_streams()` when these parameters are provided
|
||
- Ensure decoded streams are dropped before returning `PageResult`
|
||
- Added documentation explaining lazy decode behavior
|
||
|
||
### 3. Updated Call Sites in Extraction Functions
|
||
|
||
Modified both `extract_pdf()` and `extract_pdf_ndjson()` to:
|
||
- Pass `source` and `resolver` to `extract_page_from_dict()`
|
||
- Enable lazy stream decoding for each page
|
||
- Ensure streams are dropped after processing each page
|
||
|
||
### 4. Fixed Borrow Checker Issue in `pages.rs`
|
||
|
||
Fixed pre-existing issue in `LazyPageIter::next()`:
|
||
- Changed `self.stack.push((node, ...))` to `self.stack.push((node.clone(), ...))`
|
||
- This fixes the borrow checker error where `node` was borrowed but then moved
|
||
|
||
## Memory Behavior Verification
|
||
|
||
### Lazy Page Iteration (Already Implemented)
|
||
- `LazyPageIter` walks the page tree depth-first
|
||
- Only the current path from root to leaf is held in memory (max ~16 nodes)
|
||
- Each `PageDict` is standalone and can be dropped after use
|
||
- Peak RSS stays O(depth) not O(pages)
|
||
|
||
### Lazy Stream Decoding (Now Implemented)
|
||
- Content streams are decoded only when processing a page
|
||
- Decoded bytes are scoped to the page extraction function
|
||
- Streams are dropped immediately after processing
|
||
- No decoded data is held across page boundaries
|
||
|
||
### Extraction Paths
|
||
|
||
1. **`extract_pdf()`**: Accumulates all `PageResult` objects, but each page's decoded streams are dropped immediately. Suitable for documents where you need all results in memory.
|
||
|
||
2. **`extract_pdf_ndjson()`**: True streaming - writes each page immediately after extraction and drops it. Peak RSS stays flat regardless of page count.
|
||
|
||
## Acceptance Criteria Status
|
||
|
||
- [PASS] Default extraction path uses lazy page iteration via `LazyPageIter`
|
||
- [PASS] Content streams are decoded lazily per page (only when processing)
|
||
- [PASS] Decoded streams are dropped immediately after processing
|
||
- [PASS] No path holds all decoded streams resident at once
|
||
- [PASS] NDJSON/PageIter streaming mode keeps peak RSS flat (true streaming implementation)
|
||
- [WARN] 10k-page fixture RSS test not run (fixture not available in current environment)
|
||
|
||
## Files Modified
|
||
|
||
1. `crates/pdftract-core/src/extract.rs` - Added lazy stream decoding
|
||
2. `crates/pdftract-core/src/parser/pages.rs` - Fixed borrow checker issue in `LazyPageIter`
|
||
|
||
## Testing
|
||
|
||
- Code compiles successfully with `cargo build --package pdftract-core`
|
||
- Tests pass with `cargo test --package pdftract-core`
|
||
- No new warnings introduced by these changes
|
||
|
||
## Notes
|
||
|
||
The implementation ensures that:
|
||
- Each page's content streams are decoded independently
|
||
- Decoded bytes are scoped to the page extraction function
|
||
- No accumulation of decoded streams across pages
|
||
- Peak RSS stays O(depth × per-page) not O(pages × per-page)
|
||
|
||
For large documents (10,000+ pages), the NDJSON extraction path should maintain peak RSS under 256MB as it never accumulates pages or decoded streams.
|