A PDF text extraction library that gets the hard parts right.
Find a file
jedarden 4f39a9b46c feat(pdftract-2ix9u): implement PageClass enum
Add the four canonical page classification variants (Vector, Scanned,
Hybrid, BrokenVector) with full serde support and Hash derive for use
in cache keying and routing tables.

Per INV-9 (stable taxonomy), these four variants are the complete set;
adding new variants requires a schema_version bump and an ADR.

Acceptance criteria:
- PASS: pdftract-core compiles with the new module
- PASS: Unit test serialize/deserialize roundtrip for each variant
- PASS: Unit test verifies PageClass is hashable and usable in HashMap
- PASS: Module docstring cites INV-9

Closes: pdftract-2ix9u
2026-05-25 01:07:08 -04:00
.cargo feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata 2026-05-24 17:31:16 -04:00
.ci/argo-workflows feat(pdftract-315s): implement WER CI gate and OCR CLI flags 2026-05-24 02:07:27 -04:00
.config ci(pdftract-5rvp9): add nextest configuration for CI 2026-05-23 11:42:44 -04:00
.git-hooks fix(pdftract-5z5d8): add pre-commit hook for provenance validation 2026-05-17 23:50:28 -04:00
.github docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates 2026-05-24 13:06:57 -04:00
.marathon feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
benches fix(pdftract-60h): fix bugs in benchmark runner script 2026-05-18 01:29:41 -04:00
build feat(glyph-shape): implement font corpus fetch script and shape DB generation 2026-05-24 09:48:29 -04:00
ci feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate 2026-05-24 10:52:41 -04:00
crates feat(pdftract-2ix9u): implement PageClass enum 2026-05-25 01:07:08 -04:00
distribution feat(pdftract-1eaxm): implement libpdftract C FFI library 2026-05-23 08:55:12 -04:00
docs docs(pdftract-5nare): add comprehensive FAQ with 24 questions 2026-05-25 00:22:48 -04:00
examples feat(pdftract-3zhf): add unified TableDetector::detect entry point 2026-05-24 00:51:59 -04:00
fuzz docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
notes docs(pdftract-2wif9): add verification note for Java publish workflow 2026-05-25 00:58:18 -04:00
pdftract-dotnet feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
pdftract-go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
pdftract-java feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
pdftract-node feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
profiles/builtin feat(profiles): implement built-in classification profiles (5.6.4) 2026-05-24 15:04:43 -04:00
proptest-regressions feat(pdftract-33v): implement property tests and nightly fuzz job 2026-05-22 23:13:13 -04:00
scripts feat(glyph-shape): implement font corpus fetch script and shape DB generation 2026-05-24 09:48:29 -04:00
src feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
templates/sdk-skeleton docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
tests test(pdftract-43jxa): implement TH-07 ps leak security test 2026-05-25 00:45:57 -04:00
tools feat(bf-2ervu): implement mmap-backed PdfSource via memmap2 2026-05-24 08:40:11 -04:00
xtask feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata 2026-05-24 17:31:16 -04:00
.gitignore feat(pdftract-juc): implement Standard 14 font metrics registry 2026-05-23 14:04:02 -04:00
.needle-predispatch-sha feat(pdftract-core): add run_tesseract integration and WER calculation 2026-05-24 01:12:33 -04:00
.nextest.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
.renovaterc.json docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule 2026-05-20 18:22:03 -04:00
audit.toml ci(pdftract-5gs4p): add cargo-audit configuration with allow-list 2026-05-23 11:11:25 -04:00
Cargo-dist.toml docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
Cargo.lock feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs 2026-05-24 17:01:53 -04:00
Cargo.toml docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
CHANGELOG.md feat(pdftract-2w02): implement MSRV gate with CI check 2026-05-20 19:03:53 -04:00
CLAUDE.md chore: update push remote to forgejo 2026-05-19 19:59:18 -04:00
clippy.toml feat(pdftract-xzfkt): implement caption block classifier 2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates 2026-05-24 13:06:57 -04:00
conformance_test feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
CONTRIBUTING.md docs(contributing): add Argo-CI caveat, DCO sign-off, and contributor templates 2026-05-24 06:00:48 -04:00
Cross.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
deny.toml ci(pdftract-1rljr): add cargo-deny quality gate configuration 2026-05-23 11:20:36 -04:00
Dockerfile feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support 2026-05-20 19:17:49 -04:00
LICENSE-APACHE docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
LICENSE-MIT docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
mod feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
pdftract-test-merged.cdx.json feat(pdftract-67tm8): implement MCP stdio transport with integration tests 2026-05-23 00:16:42 -04:00
README.md docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates 2026-05-24 13:06:57 -04:00
SECURITY.md docs(pdftract-58kz): add security policy documentation 2026-05-20 19:39:24 -04:00
test_api_null.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_empty feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_empty.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_flate.rs docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
test_pdf feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
test_trailer_parsing.rs feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00

pdftract

MSRV

A PDF text extraction library that gets the hard parts right.

What it does

  • Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
  • Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching
  • Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost
  • Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
  • Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump

Output

{
  "pages": [
    {
      "page": 1,
      "blocks": [
        { "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
        { "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
      ],
      "spans": [
        { "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
      ]
    }
  ],
  "metadata": { "title": "...", "author": "...", "page_count": 10 }
}

Usage

pdftract extract invoice.pdf            # structured JSON to stdout
pdftract extract invoice.pdf --text     # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080              # HTTP service: POST /extract

Installation

If you have Rust toolchain installed, the quickest way to get a prebuilt binary is via cargo binstall:

cargo install cargo-binstall
cargo binstall pdftract

This downloads the appropriate binary for your platform from the GitHub Releases (2-3 seconds) instead of compiling from source.

Pre-built binaries

Download directly from GitHub Releases:

  • Linux (x86_64): pdftract-v*-x86_64-unknown-linux-musl.tar.gz
  • macOS (Apple Silicon): pdftract-v*-aarch64-apple-darwin.tar.gz
  • macOS (Intel): pdftract-v*-x86_64-apple-darwin.tar.gz
  • Windows: pdftract-v*-x86_64-pc-windows-gnu.zip

Build from source

cargo install pdftract --features full-render,ocr

See docs/notes/ for language-specific SDK installation examples (Python, Node.js, Go, Ruby, Java, Rust, Bash).

Architecture

Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.

See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.

Verifying Releases

All releases are signed using Sigstore keyless signing with OIDC from the iad-ci cluster. This provides cryptographic proof that artifacts were produced by the official CI/CD pipeline and haven't been tampered with.

Verify Binary Archives

To verify downloaded binary archives:

# Download release artifacts
gh release download vX.Y.Z --dir /tmp/pdftract-release

# Verify the SHA256SUMS signature
cosign verify-blob \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  --signature SHA256SUMS.sig \
  --certificate SHA256SUMS.pem \
  SHA256SUMS

# Verify individual artifacts against checksums
sha256sum -c SHA256SUMS

Verify Docker Images

To verify Docker images before running them:

# Verify the main image
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:X.Y.Z

# Verify the OCR variant
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:ocr-X.Y.Z

# Verify the full variant
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:full-X.Y.Z

View SLSA Provenance

Each Docker image includes SLSA provenance attestation:

cosign verify-attestation \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  --type slsaprovenance \
  ghcr.io/jedarden/pdftract:X.Y.Z

The provenance includes the build configuration, source commit, and builder identity.

Security

For responsible disclosure of security vulnerabilities, please email security@jedarden.com. See SECURITY.md for our disclosure policy, supported versions, and PGP key for encrypted reports.

PGP Key: The public key for security@jedarden.com is available at docs/security/pgp-public-key.asc.

NOTE: The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for security@jedarden.com. See docs/security/pgp-public-key.asc for generation instructions.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for:

  • Development setup and build instructions
  • Local validation checklist before opening a PR
  • Commit message style (Conventional Commits)
  • CI on forks (maintainer-triggered Argo workflow)
  • DCO sign-off requirement

By participating in this project, you agree to abide by our Code of Conduct.

Status

Early development. See docs/plan/ for the implementation roadmap.