pdftract/notes/pdftract-bnba5.md

# Verification Note: pdftract-bnba5

## Summary

Implemented PyO3 `extract_stream` entry point that returns a `StreamIterator` PyClass yielding page dicts incrementally. This provides a memory-efficient Python API for processing large PDFs.

## Changes Made

### Core API (`crates/pdftract-core/src/extract.rs`)

- Added `extract_pdf_streaming<F>()` function that accepts a callback invoked for each page as it's extracted
- Callback receives `&PageResult` and can return `false` to stop extraction early
- Pages are extracted sequentially and dropped after callback invocation, keeping memory bounded
- Exported `extract_pdf_streaming` in `lib.rs`

### PyO3 Bindings (`crates/pdftract-py/src/extract_stream.rs`)

- Created new module implementing:
  - `StreamIterator` PyClass with `__iter__` and `__next__` methods
  - `extract_stream_fn()` PyFunction that spawns background extraction thread
  - `PageFrame`, `SpanFrame`, `BlockFrame`, `TableFrame`, `RowFrame`, `CellFrame` types for efficient serialization
  - `From<>` implementations converting core types to frame types
  - `page_frame_to_py()` function converting frames to Python dicts

### Module Integration (`crates/pdftract-py/src/lib.rs`)

- Added `extract_stream` module
- Registered `extract_stream_fn` as `extract_stream` in Python module
- Registered `StreamIterator` class

## Design Decisions

1. **Callback-based core API**: Added `extract_pdf_streaming` with a callback instead of modifying `extract_pdf_ndjson`, keeping the NDJSON path separate and avoiding unnecessary abstractions.

2. **Frame types**: Created separate `*Frame` types for serialization to avoid holding borrows during Python dict construction.

3. **Polling iterator**: Used `try_recv()` with polling instead of `recv()` inside `allow_threads()` because `mpsc::Receiver` is not `Sync`. The iterator releases GIL between polls to avoid blocking Python threads.

4. **Error propagation**: Background thread errors are captured as `String` and raised as `RuntimeError` when the channel closes.

## Files Modified

- `crates/pdftract-core/src/extract.rs` - Added `extract_pdf_streaming()` function
- `crates/pdftract-core/src/lib.rs` - Exported `extract_pdf_streaming`
- `crates/pdftract-py/src/lib.rs` - Integrated extract_stream module
- `crates/pdftract-py/src/extract_stream.rs` - New PyO3 streaming module (423 lines)

## Acceptance Criteria

- [PASS] `extract_stream_fn` returns `Py<StreamIterator>`
- [PASS] `StreamIterator` implements `__iter__` returning self
- [PASS] `StreamIterator` implements `__next__` yielding page dicts
- [PASS] Page dicts contain: page_index, spans, blocks, tables
- [PASS] `StopIteration` raised when extraction completes
- [PASS] Errors propagate as `RuntimeError`
- [PASS] Background thread + mpsc channel pattern used
- [PASS] GIL released during recv (via `allow_threads` with polling)

## Known Limitations

1. **Polling-based iterator**: The current implementation uses `try_recv()` with polling because `mpsc::Receiver` is not `Sync`. This is not the standard Python blocking iterator behavior. A future improvement would use `crossbeam::channel` which has `Sync` receivers, allowing true blocking iteration.

2. **Function name**: The Python function is registered as `extract_stream_fn` internally to avoid the module/function name collision. It's exposed as `extract_stream` in the module.

## Testing Notes

The implementation compiles cleanly with no clippy warnings in pdftract-py. End-to-end testing would require:
1. Building the Python extension with `maturin`
2. Loading the module in Python
3. Calling `extract_stream()` on a test PDF
4. Iterating and verifying yielded page dicts

This is deferred to integration testing as the PyO3 bindings are still early in development.