A PDF text extraction library that gets the hard parts right.

Find a file

jedarden b72d8312ce test(pdftract-57o4): add ParentTree integration tests for annotation and sparse arrays Add two comprehensive integration tests to validate the ParentTree resolver: 1. test_parent_tree_annotation_with_struct_parent: - Creates a body paragraph StructElem - Creates ParentTree with page array (MCID 0 -> body, MCID 1 -> orphan/null) - Creates ParentTree with annotation entry (key 100 -> body) - Verifies MCID resolution returns correct map and orphans - Verifies annotation /StructParent resolution returns the body ref - Verifies the referenced StructElem is in the tree 2. test_parent_tree_off_by_one_missing_entries: - Creates ParentTree with sparse array (only 3 entries for potentially more MCIDs) - Verifies non-null entries are correctly mapped - Verifies null entries are recorded as orphans - Documents that MCIDs beyond array length would be detected in Phase 7.1.4 Also export ParentTreeResolver and ParentTreeEntry from parser module for use by the block builder in Phase 7.1.4. All 67 struct_tree tests pass (18 ParentTree-specific tests). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-23 18:36:09 -04:00
.cargo	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
.ci/argo-workflows	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement	2026-05-23 13:22:55 -04:00
.config	ci(pdftract-5rvp9): add nextest configuration for CI	2026-05-23 11:42:44 -04:00
.git-hooks	fix(pdftract-5z5d8): add pre-commit hook for provenance validation	2026-05-17 23:50:28 -04:00
.github/ISSUE_TEMPLATE	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
benches	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
crates	test(pdftract-57o4): add ParentTree integration tests for annotation and sparse arrays	2026-05-23 18:36:09 -04:00
distribution	feat(pdftract-1eaxm): implement libpdftract C FFI library	2026-05-23 08:55:12 -04:00
docs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
examples	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
fuzz	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
notes	test(pdftract-57o4): add ParentTree integration tests for annotation and sparse arrays	2026-05-23 18:36:09 -04:00
pdftract-dotnet	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
pdftract-go	fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup	2026-05-20 19:08:14 -04:00
pdftract-java	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-node	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
profiles/builtin	docs(pdftract-4iier): complete per-profile README documentation	2026-05-18 00:35:35 -04:00
proptest-regressions	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
scripts	test(bf-5dnh1): add memory ceiling enforcement for proptests	2026-05-23 13:39:04 -04:00
src	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
templates/sdk-skeleton	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
tests	feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate	2026-05-23 15:04:05 -04:00
tools	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement	2026-05-23 13:22:55 -04:00
xtask	feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate	2026-05-23 15:04:05 -04:00
.gitignore	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
.needle-predispatch-sha	feat(pdftract-5nbp): implement /Differences overlay handler for font encodings	2026-05-23 18:09:46 -04:00
.nextest.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
.renovaterc.json	docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule	2026-05-20 18:22:03 -04:00
audit.toml	ci(pdftract-5gs4p): add cargo-audit configuration with allow-list	2026-05-23 11:11:25 -04:00
Cargo-dist.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
Cargo.lock	feat(pdftract-4my): implement pdfium-render path behind full-render feature	2026-05-23 16:28:08 -04:00
Cargo.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
CHANGELOG.md	feat(pdftract-2w02): implement MSRV gate with CI check	2026-05-20 19:03:53 -04:00
CLAUDE.md	chore: update push remote to forgejo	2026-05-19 19:59:18 -04:00
clippy.toml	feat(pdftract-2w02): pin MSRV to 1.78 with CI gate	2026-05-20 19:03:53 -04:00
conformance_test	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
CONTRIBUTING.md	docs(pdftract-16wv): add Apache NOTICE licensing documentation to CONTRIBUTING.md	2026-05-23 10:59:19 -04:00
Cross.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
deny.toml	ci(pdftract-1rljr): add cargo-deny quality gate configuration	2026-05-23 11:20:36 -04:00
Dockerfile	feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support	2026-05-20 19:17:49 -04:00
LICENSE-APACHE	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
LICENSE-MIT	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
mod	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
pdftract-test-merged.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
SECURITY.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
test_api_null.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_flate.rs	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
test_trailer_parsing.rs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump

Output

{
  "pages": [
    {
      "page": 1,
      "blocks": [
        { "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
        { "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
      ],
      "spans": [
        { "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
      ]
    }
  ],
  "metadata": { "title": "...", "author": "...", "page_count": 10 }
}

Usage

pdftract extract invoice.pdf            # structured JSON to stdout
pdftract extract invoice.pdf --text     # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080              # HTTP service: POST /extract

Architecture

Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.

See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.

Verifying Releases

All releases are signed using Sigstore keyless signing with OIDC from the iad-ci cluster. This provides cryptographic proof that artifacts were produced by the official CI/CD pipeline and haven't been tampered with.

Verify Binary Archives

To verify downloaded binary archives:

# Download release artifacts
gh release download vX.Y.Z --dir /tmp/pdftract-release

# Verify the SHA256SUMS signature
cosign verify-blob \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  --signature SHA256SUMS.sig \
  --certificate SHA256SUMS.pem \
  SHA256SUMS

# Verify individual artifacts against checksums
sha256sum -c SHA256SUMS

Verify Docker Images

To verify Docker images before running them:

# Verify the main image
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:X.Y.Z

# Verify the OCR variant
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:ocr-X.Y.Z

# Verify the full variant
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:full-X.Y.Z

View SLSA Provenance

Each Docker image includes SLSA provenance attestation:

cosign verify-attestation \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  --type slsaprovenance \
  ghcr.io/jedarden/pdftract:X.Y.Z

The provenance includes the build configuration, source commit, and builder identity.

Security

For responsible disclosure of security vulnerabilities, please email security@jedarden.com. See SECURITY.md for our disclosure policy, supported versions, and PGP key for encrypted reports.

PGP Key: The public key for security@jedarden.com is available at docs/security/pgp-public-key.asc.

NOTE: The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for security@jedarden.com. See docs/security/pgp-public-key.asc for generation instructions.

Status

Early development. See docs/plan/ for the implementation roadmap.