Implement all 14 environment checks for the `pdftract doctor` subcommand. Each check returns a CheckResult with status (OK/WARN/FAIL/NotApplicable) and a human-readable detail message. Checks implemented: - pdftract binary (version, git SHA, compiled features) - tesseract install (version check: >=5 OK, ==4 WARN, <=3 FAIL) - tesseract languages (eng + requested langs present) - leptonica install (>=1.79 OK, older WARN, not found FAIL) - libtiff (pkg-config check with ldconfig fallback) - libopenjp2 (pkg-config check with ldconfig fallback) - pdfium native lib (version >=6555 OK, older WARN, not found FAIL) - network reachability (HEAD example.com with 5s timeout) - cache directory (writable, free space >=1 GiB, layout version) - profile search path (YAML parse, PROFILE_SECRETS_FORBIDDEN detection) - ulimit -n (>=1024 OK, 512-1024 WARN, <512 FAIL) - available RAM (>=256 MiB OK, 128-256 WARN, <128 FAIL) - system locale (UTF-8 OK, non-UTF-8 WARN, unset FAIL) - temp dir writable (writable + free space >=100 MiB) Core module with Check trait, CheckResult, CheckStatus, DoctorCtx, DoctorFeatures, and panic-safe run_check_safe wrapper. Build script injects GIT_SHA and COMPILED_FEATURES at compile time. All checks feature-gated appropriately (ocr, full-render, remote, profiles). Co-Authored-By: Claude Code <noreply@anthropic.com> |
||
|---|---|---|
| .cargo | ||
| .ci/argo-workflows | ||
| .git-hooks | ||
| .github/ISSUE_TEMPLATE | ||
| benches | ||
| crates | ||
| docs | ||
| fuzz | ||
| notes | ||
| pdftract-dotnet | ||
| pdftract-go | ||
| pdftract-java | ||
| pdftract-node | ||
| profiles/builtin | ||
| proptest-regressions | ||
| scripts | ||
| src | ||
| templates/sdk-skeleton | ||
| tests | ||
| tools | ||
| xtask | ||
| .gitignore | ||
| .needle-predispatch-sha | ||
| .nextest.toml | ||
| .renovaterc.json | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CHANGELOG.md | ||
| CLAUDE.md | ||
| clippy.toml | ||
| CONTRIBUTING.md | ||
| Dockerfile | ||
| mod | ||
| pdftract-test-merged.cdx.json | ||
| README.md | ||
| SECURITY.md | ||
| test_flate.rs | ||
pdftract
A PDF text extraction library that gets the hard parts right.
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a
StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost - Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump
Output
{
"pages": [
{
"page": 1,
"blocks": [
{ "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
{ "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
],
"spans": [
{ "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
]
}
],
"metadata": { "title": "...", "author": "...", "page_count": 10 }
}
Usage
pdftract extract invoice.pdf # structured JSON to stdout
pdftract extract invoice.pdf --text # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080 # HTTP service: POST /extract
Architecture
Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.
See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.
Verifying Releases
All releases are signed using Sigstore keyless signing with OIDC from the iad-ci cluster. This provides cryptographic proof that artifacts were produced by the official CI/CD pipeline and haven't been tampered with.
Verify Binary Archives
To verify downloaded binary archives:
# Download release artifacts
gh release download vX.Y.Z --dir /tmp/pdftract-release
# Verify the SHA256SUMS signature
cosign verify-blob \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
--signature SHA256SUMS.sig \
--certificate SHA256SUMS.pem \
SHA256SUMS
# Verify individual artifacts against checksums
sha256sum -c SHA256SUMS
Verify Docker Images
To verify Docker images before running them:
# Verify the main image
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:X.Y.Z
# Verify the OCR variant
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:ocr-X.Y.Z
# Verify the full variant
cosign verify \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
ghcr.io/jedarden/pdftract:full-X.Y.Z
View SLSA Provenance
Each Docker image includes SLSA provenance attestation:
cosign verify-attestation \
--certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
--certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
--type slsaprovenance \
ghcr.io/jedarden/pdftract:X.Y.Z
The provenance includes the build configuration, source commit, and builder identity.
Security
For responsible disclosure of security vulnerabilities, please email security@jedarden.com. See SECURITY.md for our disclosure policy, supported versions, and PGP key for encrypted reports.
PGP Key: The public key for security@jedarden.com is available at docs/security/pgp-public-key.asc.
NOTE: The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for
security@jedarden.com. Seedocs/security/pgp-public-key.ascfor generation instructions.
Status
Early development. See docs/plan/ for the implementation roadmap.