Configure maturin to build Python wheels for 5 target triples using cross-compilation from a single Linux runner. Enable ABI3 for forward compatibility across Python 3.10+. Changes: - pyproject.toml: Set requires-python = ">=3.10" (down from 3.11) - pyproject.toml: Add Python 3.10 classifier - pyproject.toml: Update comment to reflect 3.10+ compatibility - Cargo.toml: Add pyo3 abi3-py310 feature - docs/operations/build-wheels.md: Document cross-compilation setup Target triples: - x86_64-unknown-linux-gnu (manylinux_2_28_x86_64) - aarch64-unknown-linux-gnu (manylinux_2_28_aarch64) - x86_64-apple-darwin (macosx_11_0_x86_64) - aarch64-apple-darwin (macosx_11_0_arm64) - x86_64-pc-windows-gnu (win_amd64) All wheels will be ABI3 (cp310-abi3) compatible, producing a single wheel per platform instead of N versions × 5 platforms. Refs: pdftract-30ahi, Phase 6.3.4 |
||
|---|---|---|
| .cargo | ||
| .ci/argo-workflows | ||
| .config | ||
| .git-hooks | ||
| .github | ||
| .marathon | ||
| benches | ||
| build | ||
| ci | ||
| crates | ||
| distribution | ||
| docs | ||
| examples | ||
| fuzz | ||
| notes | ||
| pdftract-dotnet | ||
| pdftract-go | ||
| pdftract-java | ||
| pdftract-node | ||
| profiles/builtin | ||
| proptest-regressions | ||
| scripts | ||
| src | ||
| templates/sdk-skeleton | ||
| tests | ||
| tools | ||
| xtask | ||
| .gitignore | ||
| .needle-predispatch-sha | ||
| .nextest.toml | ||
| .renovaterc.json | ||
| audit.toml | ||
| Cargo-dist.toml | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CHANGELOG.md | ||
| CLAUDE.md | ||
| clippy.toml | ||
| CODE_OF_CONDUCT.md | ||
| conformance_test | ||
| CONTRIBUTING.md | ||
| Cross.toml | ||
| deny.toml | ||
| Dockerfile | ||
| libstdin.rlib | ||
| LICENSE-APACHE | ||
| LICENSE-MIT | ||
| mod | ||
| pdftract-test-merged.cdx.json | ||
| README.md | ||
| SECURITY.md | ||
| test_api_null.c | ||
| test_classifier_corpus | ||
| test_empty | ||
| test_empty.c | ||
| test_flate.rs | ||
| test_page_class | ||
| test_pdf | ||
| test_trailer_parsing.rs | ||
pdftract
A PDF text extraction library that gets the hard parts right.
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a
StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost - Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump
Output
{
"pages": [
{
"page": 1,
"blocks": [
{ "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
{ "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
],
"spans": [
{ "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
]
}
],
"metadata": { "title": "...", "author": "...", "page_count": 10 }
}
Usage
pdftract extract invoice.pdf # structured JSON to stdout
pdftract extract invoice.pdf --text # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080 # HTTP service: POST /extract
Installation
cargo binstall (recommended, fastest)
If you have Rust toolchain installed, the quickest way to get a prebuilt binary is via cargo binstall:
cargo install cargo-binstall
cargo binstall pdftract
This downloads the appropriate binary for your platform from the GitHub Releases (2-3 seconds) instead of compiling from source.
Pre-built binaries
Download directly from GitHub Releases:
- Linux (x86_64):
pdftract-v*-x86_64-unknown-linux-musl.tar.gz - macOS (Apple Silicon):
pdftract-v*-aarch64-apple-darwin.tar.gz - macOS (Intel):
pdftract-v*-x86_64-apple-darwin.tar.gz - Windows:
pdftract-v*-x86_64-pc-windows-gnu.zip
Build from source
cargo install pdftract --features full-render,ocr
See docs/notes/ for language-specific SDK installation examples (Python, Node.js, Go, Ruby, Java, Rust, Bash).
Architecture
Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.
See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.
Verifying Releases
All releases are signed using Sigstore keyless signing with OIDC from the iad-ci cluster. This provides cryptographic proof that artifacts were produced by the official CI/CD pipeline and haven't been tampered with.
Verify Binary Archives
To verify downloaded binary archives:
# Download release artifacts
gh release download vX.Y.Z --dir /tmp/pdftract-release
# Verify the SHA256SUMS signature
cosign verify-blob \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
--signature SHA256SUMS.sig \
--certificate SHA256SUMS.pem \
SHA256SUMS
# Verify individual artifacts against checksums
sha256sum -c SHA256SUMS
Verify Docker Images
To verify Docker images before running them:
# Verify the main image
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:X.Y.Z
# Verify the OCR variant
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:ocr-X.Y.Z
# Verify the full variant
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:full-X.Y.Z
View SLSA Provenance
Each Docker image includes SLSA provenance attestation:
cosign verify-attestation \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
--type slsaprovenance \
ghcr.io/jedarden/pdftract:X.Y.Z
The provenance includes the build configuration, source commit, and builder identity.
Security
For responsible disclosure of security vulnerabilities, please email security@jedarden.com. See SECURITY.md for our disclosure policy, supported versions, and PGP key for encrypted reports.
PGP Key: The public key for security@jedarden.com is available at docs/security/pgp-public-key.asc.
NOTE: The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for
security@jedarden.com. Seedocs/security/pgp-public-key.ascfor generation instructions.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for:
- Development setup and build instructions
- Local validation checklist before opening a PR
- Commit message style (Conventional Commits)
- CI on forks (maintainer-triggered Argo workflow)
- DCO sign-off requirement
By participating in this project, you agree to abide by our Code of Conduct.
Status
Early development. See docs/plan/ for the implementation roadmap.