Fixed compilation bug in generate_book_chapter_fixtures.rs where chapter_number() returns () but code tried to assign result back to builder. This was blocking test compilation. Verified that the error handling implementation in serve.rs is complete and meets all acceptance criteria: - ApiError struct with error, message, hint fields - AxumError enum with IntoResponse impl for all error types - Custom 413 middleware converting text/plain to JSON - Status code mapping: 400, 413, 422, 500 - All 18 serve module tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .cargo | ||
| .ci/argo-workflows | ||
| .config | ||
| .git-hooks | ||
| .github | ||
| .marathon | ||
| benches | ||
| build | ||
| ci | ||
| crates | ||
| distribution | ||
| docs | ||
| examples | ||
| fuzz | ||
| notes | ||
| pdftract-dotnet | ||
| pdftract-go | ||
| pdftract-java | ||
| pdftract-node | ||
| profiles/builtin | ||
| proptest-regressions | ||
| scripts | ||
| src | ||
| templates/sdk-skeleton | ||
| tests | ||
| tools | ||
| xtask | ||
| .gitignore | ||
| .needle-predispatch-sha | ||
| .nextest.toml | ||
| .renovaterc.json | ||
| audit.toml | ||
| Cargo-dist.toml | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CHANGELOG.md | ||
| CLAUDE.md | ||
| clippy.toml | ||
| CODE_OF_CONDUCT.md | ||
| conformance_test | ||
| CONTRIBUTING.md | ||
| Cross.toml | ||
| deny.toml | ||
| Dockerfile | ||
| libstdin.rlib | ||
| LICENSE-APACHE | ||
| LICENSE-MIT | ||
| mod | ||
| pdftract-test-merged.cdx.json | ||
| README.md | ||
| SECURITY.md | ||
| test_api_null.c | ||
| test_classifier_corpus | ||
| test_empty | ||
| test_empty.c | ||
| test_flate.rs | ||
| test_page_class | ||
| test_pdf | ||
| test_trailer_parsing.rs | ||
pdftract
A PDF text extraction library that gets the hard parts right.
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a
StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost - Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump
Output
{
"pages": [
{
"page": 1,
"blocks": [
{ "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
{ "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
],
"spans": [
{ "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
]
}
],
"metadata": { "title": "...", "author": "...", "page_count": 10 }
}
Usage
pdftract extract invoice.pdf # structured JSON to stdout
pdftract extract invoice.pdf --text # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080 # HTTP service: POST /extract
Installation
cargo binstall (recommended, fastest)
If you have Rust toolchain installed, the quickest way to get a prebuilt binary is via cargo binstall:
cargo install cargo-binstall
cargo binstall pdftract
This downloads the appropriate binary for your platform from the GitHub Releases (2-3 seconds) instead of compiling from source.
Pre-built binaries
Download directly from GitHub Releases:
- Linux (x86_64):
pdftract-v*-x86_64-unknown-linux-musl.tar.gz - macOS (Apple Silicon):
pdftract-v*-aarch64-apple-darwin.tar.gz - macOS (Intel):
pdftract-v*-x86_64-apple-darwin.tar.gz - Windows:
pdftract-v*-x86_64-pc-windows-gnu.zip
Build from source
cargo install pdftract --features full-render,ocr
See docs/notes/ for language-specific SDK installation examples (Python, Node.js, Go, Ruby, Java, Rust, Bash).
Architecture
Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.
See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.
Verifying Releases
All releases are signed using Sigstore keyless signing with OIDC from the iad-ci cluster. This provides cryptographic proof that artifacts were produced by the official CI/CD pipeline and haven't been tampered with.
Verify Binary Archives
To verify downloaded binary archives:
# Download release artifacts
gh release download vX.Y.Z --dir /tmp/pdftract-release
# Verify the SHA256SUMS signature
cosign verify-blob \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
--signature SHA256SUMS.sig \
--certificate SHA256SUMS.pem \
SHA256SUMS
# Verify individual artifacts against checksums
sha256sum -c SHA256SUMS
Verify Docker Images
To verify Docker images before running them:
# Verify the main image
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:X.Y.Z
# Verify the OCR variant
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:ocr-X.Y.Z
# Verify the full variant
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:full-X.Y.Z
View SLSA Provenance
Each Docker image includes SLSA provenance attestation:
cosign verify-attestation \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
--type slsaprovenance \
ghcr.io/jedarden/pdftract:X.Y.Z
The provenance includes the build configuration, source commit, and builder identity.
Security
For responsible disclosure of security vulnerabilities, please email security@jedarden.com. See SECURITY.md for our disclosure policy, supported versions, and PGP key for encrypted reports.
PGP Key: The public key for security@jedarden.com is available at docs/security/pgp-public-key.asc.
NOTE: The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for
security@jedarden.com. Seedocs/security/pgp-public-key.ascfor generation instructions.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for:
- Development setup and build instructions
- Local validation checklist before opening a PR
- Commit message style (Conventional Commits)
- CI on forks (maintainer-triggered Argo workflow)
- DCO sign-off requirement
By participating in this project, you agree to abide by our Code of Conduct.
Status
Early development. See docs/plan/ for the implementation roadmap.