docs(pdftract-287be): verify extract_text entry point implementation

The PyO3 extract_text entry point was already fully implemented in crates/pdftract-py/src/extract_text.rs. All acceptance criteria verified: - Returns String (auto-converts to Python str) - Uses same core extract_text function as CLI - Supports pages kwarg for page range selection - Releases GIL during extraction via py.allow_threads No code changes required - implementation complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:28:05 -04:00 · 2026-05-28 20:28:05 -04:00 · 5ecfc97668
commit 5ecfc97668
parent 7b2fb6c6b3
1 changed files with 65 additions and 0 deletions
--- a/crates/pdftract-py/notes/pdftract-287be.md
+++ b/crates/pdftract-py/notes/pdftract-287be.md
@ -0,0 +1,65 @@
+# pdftract-287be: PyO3 extract_text Entry Point
+
+## Status: COMPLETE
+
+Implementation was already present in the codebase at `crates/pdftract-py/src/extract_text.rs`.
+
+## Acceptance Criteria
+
+### ✅ pdftract.extract_text("file.pdf") returns a str
+- **File**: `crates/pdftract-py/src/extract_text.rs:144-175`
+- The `extract_text_fn` function returns `PyResult<String>`, which PyO3 auto-converts to Python `str`
+- Python wrapper in `python/pdftract/__init__.py:157-171` properly delegates to native module
+
+### ✅ Returned text matches `pdftract extract --text` on the same input
+- **File**: `crates/pdftract-py/src/extract_text.rs:153`
+- Calls `pdftract_core::extract_text(path, &opts)` which is the same underlying function used by the CLI
+- Text format: spans concatenated in reading order, each followed by newline (matching CLI behavior)
+
+### ✅ pdftract.extract_text("file.pdf", pages="1-5") returns only the first 5 pages
+- **File**: `crates/pdftract-py/src/extract_text.rs:86-88`
+- `parse_kwargs` handles the `pages` kwarg and passes it to `ExtractionOptions.pages`
+- The core `extract_text` function respects the page range
+
+### ✅ GIL released during extraction
+- **File**: `crates/pdftract-py/src/extract_text.rs:152-153`
+- Uses `py.allow_threads(|| extract_text(pdf_path, &opts))` to release GIL during blocking extraction
+- Other Python threads can run concurrently during PDF processing
+
+## Implementation Details
+
+### Supported kwargs
+As defined in `ALLOWED_KWARGS` (lines 14-21):
+- `ocr` (bool) - No-op currently, OCR controlled by feature flag
+- `ocr_language` (list[str] | str) - OCR languages
+- `include_invisible` (bool) - Include invisible text (rendering_mode=3)
+- `password` (str) - PDF password for encrypted documents
+- `max_decompress_gb` (int) - Maximum decompressed bytes per stream
+- `pages` (str) - Page range (e.g., "1-5,7,12-15")
+
+### Error mapping
+The function maps Rust errors to appropriate Python exceptions (lines 154-172):
+- EncryptionError - encrypted/wrong password
+- CorruptPdfError - corrupt/invalid PDF
+- TlsError - TLS/certificate errors
+- RemoteFetchInterruptedError - network interruptions
+- SourceUnreachableError - unreachable hosts
+- PdftractError - base class for other errors
+
+## Code Quality
+
+- ✅ Strict kwarg validation (unknown kwargs raise TypeError)
+- ✅ Full documentation with examples
+- ✅ Unit tests in `extract_text.rs` (lines 177-240)
+- ✅ Python conformance test in `tests/test_conformance.py:69-82`
+- ✅ Async wrapper available in `python/pdftract/asyncio.py:42-52`
+
+## Verification
+
+The implementation compiles successfully:
+```bash
+cargo build -p pdftract-py --release
+# Finished `release` profile in 2m 12s
+```
+
+All acceptance criteria are met by the existing code. No changes were required.