From 5ecfc97668d712ecb8f12707656dac761ca8b640 Mon Sep 17 00:00:00 2001 From: jedarden Date: Thu, 28 May 2026 20:28:05 -0400 Subject: [PATCH] docs(pdftract-287be): verify extract_text entry point implementation The PyO3 extract_text entry point was already fully implemented in crates/pdftract-py/src/extract_text.rs. All acceptance criteria verified: - Returns String (auto-converts to Python str) - Uses same core extract_text function as CLI - Supports pages kwarg for page range selection - Releases GIL during extraction via py.allow_threads No code changes required - implementation complete. Co-Authored-By: Claude Opus 4.7 --- crates/pdftract-py/notes/pdftract-287be.md | 65 ++++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 crates/pdftract-py/notes/pdftract-287be.md diff --git a/crates/pdftract-py/notes/pdftract-287be.md b/crates/pdftract-py/notes/pdftract-287be.md new file mode 100644 index 0000000..98bf214 --- /dev/null +++ b/crates/pdftract-py/notes/pdftract-287be.md @@ -0,0 +1,65 @@ +# pdftract-287be: PyO3 extract_text Entry Point + +## Status: COMPLETE + +Implementation was already present in the codebase at `crates/pdftract-py/src/extract_text.rs`. + +## Acceptance Criteria + +### ✅ pdftract.extract_text("file.pdf") returns a str +- **File**: `crates/pdftract-py/src/extract_text.rs:144-175` +- The `extract_text_fn` function returns `PyResult`, which PyO3 auto-converts to Python `str` +- Python wrapper in `python/pdftract/__init__.py:157-171` properly delegates to native module + +### ✅ Returned text matches `pdftract extract --text` on the same input +- **File**: `crates/pdftract-py/src/extract_text.rs:153` +- Calls `pdftract_core::extract_text(path, &opts)` which is the same underlying function used by the CLI +- Text format: spans concatenated in reading order, each followed by newline (matching CLI behavior) + +### ✅ pdftract.extract_text("file.pdf", pages="1-5") returns only the first 5 pages +- **File**: `crates/pdftract-py/src/extract_text.rs:86-88` +- `parse_kwargs` handles the `pages` kwarg and passes it to `ExtractionOptions.pages` +- The core `extract_text` function respects the page range + +### ✅ GIL released during extraction +- **File**: `crates/pdftract-py/src/extract_text.rs:152-153` +- Uses `py.allow_threads(|| extract_text(pdf_path, &opts))` to release GIL during blocking extraction +- Other Python threads can run concurrently during PDF processing + +## Implementation Details + +### Supported kwargs +As defined in `ALLOWED_KWARGS` (lines 14-21): +- `ocr` (bool) - No-op currently, OCR controlled by feature flag +- `ocr_language` (list[str] | str) - OCR languages +- `include_invisible` (bool) - Include invisible text (rendering_mode=3) +- `password` (str) - PDF password for encrypted documents +- `max_decompress_gb` (int) - Maximum decompressed bytes per stream +- `pages` (str) - Page range (e.g., "1-5,7,12-15") + +### Error mapping +The function maps Rust errors to appropriate Python exceptions (lines 154-172): +- EncryptionError - encrypted/wrong password +- CorruptPdfError - corrupt/invalid PDF +- TlsError - TLS/certificate errors +- RemoteFetchInterruptedError - network interruptions +- SourceUnreachableError - unreachable hosts +- PdftractError - base class for other errors + +## Code Quality + +- ✅ Strict kwarg validation (unknown kwargs raise TypeError) +- ✅ Full documentation with examples +- ✅ Unit tests in `extract_text.rs` (lines 177-240) +- ✅ Python conformance test in `tests/test_conformance.py:69-82` +- ✅ Async wrapper available in `python/pdftract/asyncio.py:42-52` + +## Verification + +The implementation compiles successfully: +```bash +cargo build -p pdftract-py --release +# Finished `release` profile in 2m 12s +``` + +All acceptance criteria are met by the existing code. No changes were required.