A PDF text extraction library that gets the hard parts right.
The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError |
||
|---|---|---|
| .cargo | ||
| .ci | ||
| .config | ||
| .git-hooks | ||
| .github | ||
| .marathon | ||
| benches | ||
| build | ||
| ci | ||
| crates | ||
| distribution | ||
| docs | ||
| examples | ||
| fuzz | ||
| notes | ||
| pdftract-dotnet | ||
| pdftract-go | ||
| pdftract-java | ||
| pdftract-node | ||
| profiles/builtin | ||
| proptest-regressions | ||
| scripts | ||
| src | ||
| templates/sdk-skeleton | ||
| tests | ||
| tools | ||
| xtask | ||
| .gitignore | ||
| .needle-predispatch-sha | ||
| .nextest.toml | ||
| .renovaterc.json | ||
| audit.toml | ||
| audit_docs.py | ||
| Cargo-dist.toml | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CHANGELOG.md | ||
| CLAUDE.md | ||
| clippy.toml | ||
| CODE_OF_CONDUCT.md | ||
| conformance_test | ||
| CONTRIBUTING.md | ||
| Cross.toml | ||
| debug_fixtures.rs | ||
| deny.toml | ||
| Dockerfile | ||
| generate_expected_json.rs | ||
| libstdin.rlib | ||
| LICENSE-APACHE | ||
| LICENSE-MIT | ||
| mod | ||
| out.pdf | ||
| pdftract-test-merged.cdx.json | ||
| README.md | ||
| SECURITY.md | ||
| test_api_null.c | ||
| test_audit_debug.rs | ||
| test_audit_integration.rs | ||
| test_classifier_corpus | ||
| test_debug_pdf.rs | ||
| test_empty | ||
| test_empty.c | ||
| test_extract.rs | ||
| test_flate.rs | ||
| test_page_class | ||
| test_pdf | ||
| test_stream_decode.rs | ||
| test_trailer.rs | ||
| test_trailer_parsing.rs | ||
pdftract
A PDF text extraction library that gets the hard parts right.
Platform Support
| Platform | Status |
|---|---|
| Linux x86_64 | Fully CI-tested (gating CI on every PR) |
| Linux aarch64 | Fully CI-tested |
| macOS x86_64 | Build-tested; manually smoke-tested per release |
| macOS aarch64 | Build-tested; manually smoke-tested per release |
| Windows x86_64 | Build-tested; manually smoke-tested per release |
Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.
Installation
cargo
cargo install pdftract
pip
pip install pdftract
Docker
docker pull ronaldraygun/pdftract:latest
Homebrew
brew install pdftract
Quickstart
Rust
use pdftract_core::{extract_pdf, ExtractionOptions};
let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);
Python
import pdftract
doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")
CLI
pdftract extract file.pdf --json result.json # JSON output
pdftract extract file.pdf --text - # Plain text to stdout
pdftract serve --port 8080 # HTTP microservice
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
- Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score
Documentation
- User docs: docs/user-docs (mdBook)
- API reference: docs.rs/pdftract
- Contributing guide: CONTRIBUTING.md
- Security policy: SECURITY.md
- Changelog: CHANGELOG.md
- License: LICENSE-MIT or LICENSE-APACHE
License
Licensed under either of:
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
at your option.