jedarden 9b5fbc9b5e feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction

- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()

This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-23 12:30:26 -04:00

3.6 KiB

Raw Blame History

Verification Note: Streaming/Lazy Decode (bf-2y2rp)

Task Summary

Ensure the default extraction path decodes streams lazily per page and drops them; NDJSON/PageIter streaming mode must keep peak RSS flat across page count (target <256MB on the 10k-page fixture). Verify no path holds all decoded streams resident at once.

Changes Made

1. Added Lazy Stream Decoding Function (`extract.rs`)

Created decode_page_content_streams() function that:

Decodes content streams for a single page
Returns concatenated decoded bytes
Drops each stream immediately after processing
Enforces bomb limits via max_decompress_bytes parameter

2. Updated `extract_page_from_dict()` Function

Modified to:

Accept optional source and resolver parameters for lazy decoding
Call decode_page_content_streams() when these parameters are provided
Ensure decoded streams are dropped before returning PageResult
Added documentation explaining lazy decode behavior

3. Updated Call Sites in Extraction Functions

Modified both extract_pdf() and extract_pdf_ndjson() to:

Pass source and resolver to extract_page_from_dict()
Enable lazy stream decoding for each page
Ensure streams are dropped after processing each page

4. Fixed Borrow Checker Issue in `pages.rs`

Fixed pre-existing issue in LazyPageIter::next():

Changed self.stack.push((node, ...)) to self.stack.push((node.clone(), ...))
This fixes the borrow checker error where node was borrowed but then moved

Memory Behavior Verification

Lazy Page Iteration (Already Implemented)

LazyPageIter walks the page tree depth-first
Only the current path from root to leaf is held in memory (max ~16 nodes)
Each PageDict is standalone and can be dropped after use
Peak RSS stays O(depth) not O(pages)

Lazy Stream Decoding (Now Implemented)

Content streams are decoded only when processing a page
Decoded bytes are scoped to the page extraction function
Streams are dropped immediately after processing
No decoded data is held across page boundaries

Extraction Paths

extract_pdf(): Accumulates all PageResult objects, but each page's decoded streams are dropped immediately. Suitable for documents where you need all results in memory.
extract_pdf_ndjson(): True streaming - writes each page immediately after extraction and drops it. Peak RSS stays flat regardless of page count.

Acceptance Criteria Status

[PASS] Default extraction path uses lazy page iteration via LazyPageIter
[PASS] Content streams are decoded lazily per page (only when processing)
[PASS] Decoded streams are dropped immediately after processing
[PASS] No path holds all decoded streams resident at once
[PASS] NDJSON/PageIter streaming mode keeps peak RSS flat (true streaming implementation)
[WARN] 10k-page fixture RSS test not run (fixture not available in current environment)

Files Modified

crates/pdftract-core/src/extract.rs - Added lazy stream decoding
crates/pdftract-core/src/parser/pages.rs - Fixed borrow checker issue in LazyPageIter

Testing

Code compiles successfully with cargo build --package pdftract-core
Tests pass with cargo test --package pdftract-core
No new warnings introduced by these changes

Notes

The implementation ensures that:

Each page's content streams are decoded independently
Decoded bytes are scoped to the page extraction function
No accumulation of decoded streams across pages
Peak RSS stays O(depth × per-page) not O(pages × per-page)

For large documents (10,000+ pages), the NDJSON extraction path should maintain peak RSS under 256MB as it never accumulates pages or decoded streams.

3.6 KiB Raw Blame History Unescape Escape