docs(pdftract-287be): verify extract_text entry point implementation
The PyO3 extract_text entry point was already fully implemented in crates/pdftract-py/src/extract_text.rs. All acceptance criteria verified: - Returns String (auto-converts to Python str) - Uses same core extract_text function as CLI - Supports pages kwarg for page range selection - Releases GIL during extraction via py.allow_threads No code changes required - implementation complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
7b2fb6c6b3
commit
5ecfc97668
1 changed files with 65 additions and 0 deletions
65
crates/pdftract-py/notes/pdftract-287be.md
Normal file
65
crates/pdftract-py/notes/pdftract-287be.md
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
# pdftract-287be: PyO3 extract_text Entry Point
|
||||
|
||||
## Status: COMPLETE
|
||||
|
||||
Implementation was already present in the codebase at `crates/pdftract-py/src/extract_text.rs`.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### ✅ pdftract.extract_text("file.pdf") returns a str
|
||||
- **File**: `crates/pdftract-py/src/extract_text.rs:144-175`
|
||||
- The `extract_text_fn` function returns `PyResult<String>`, which PyO3 auto-converts to Python `str`
|
||||
- Python wrapper in `python/pdftract/__init__.py:157-171` properly delegates to native module
|
||||
|
||||
### ✅ Returned text matches `pdftract extract --text` on the same input
|
||||
- **File**: `crates/pdftract-py/src/extract_text.rs:153`
|
||||
- Calls `pdftract_core::extract_text(path, &opts)` which is the same underlying function used by the CLI
|
||||
- Text format: spans concatenated in reading order, each followed by newline (matching CLI behavior)
|
||||
|
||||
### ✅ pdftract.extract_text("file.pdf", pages="1-5") returns only the first 5 pages
|
||||
- **File**: `crates/pdftract-py/src/extract_text.rs:86-88`
|
||||
- `parse_kwargs` handles the `pages` kwarg and passes it to `ExtractionOptions.pages`
|
||||
- The core `extract_text` function respects the page range
|
||||
|
||||
### ✅ GIL released during extraction
|
||||
- **File**: `crates/pdftract-py/src/extract_text.rs:152-153`
|
||||
- Uses `py.allow_threads(|| extract_text(pdf_path, &opts))` to release GIL during blocking extraction
|
||||
- Other Python threads can run concurrently during PDF processing
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Supported kwargs
|
||||
As defined in `ALLOWED_KWARGS` (lines 14-21):
|
||||
- `ocr` (bool) - No-op currently, OCR controlled by feature flag
|
||||
- `ocr_language` (list[str] | str) - OCR languages
|
||||
- `include_invisible` (bool) - Include invisible text (rendering_mode=3)
|
||||
- `password` (str) - PDF password for encrypted documents
|
||||
- `max_decompress_gb` (int) - Maximum decompressed bytes per stream
|
||||
- `pages` (str) - Page range (e.g., "1-5,7,12-15")
|
||||
|
||||
### Error mapping
|
||||
The function maps Rust errors to appropriate Python exceptions (lines 154-172):
|
||||
- EncryptionError - encrypted/wrong password
|
||||
- CorruptPdfError - corrupt/invalid PDF
|
||||
- TlsError - TLS/certificate errors
|
||||
- RemoteFetchInterruptedError - network interruptions
|
||||
- SourceUnreachableError - unreachable hosts
|
||||
- PdftractError - base class for other errors
|
||||
|
||||
## Code Quality
|
||||
|
||||
- ✅ Strict kwarg validation (unknown kwargs raise TypeError)
|
||||
- ✅ Full documentation with examples
|
||||
- ✅ Unit tests in `extract_text.rs` (lines 177-240)
|
||||
- ✅ Python conformance test in `tests/test_conformance.py:69-82`
|
||||
- ✅ Async wrapper available in `python/pdftract/asyncio.py:42-52`
|
||||
|
||||
## Verification
|
||||
|
||||
The implementation compiles successfully:
|
||||
```bash
|
||||
cargo build -p pdftract-py --release
|
||||
# Finished `release` profile in 2m 12s
|
||||
```
|
||||
|
||||
All acceptance criteria are met by the existing code. No changes were required.
|
||||
Loading…
Add table
Reference in a new issue