A PDF text extraction library that gets the hard parts right.

Find a file

jedarden dddf81075f fix(pdftract-38p8h): add fallback for empty block.spans in invisible text filter The invisible text filter in serialize_page_text() was always recomputing block text from spans, but when block.spans is empty (no span data available), this produced empty text for all blocks. Added fallback to use pre-computed block.text when span data is missing, maintaining backward compatibility. Also added special case for figure blocks to always emit empty text regardless of span data. All 111 text module tests pass, including all invisible text filtering tests for Tr=0-7 and include_invisible=true/false combinations. Acceptance criteria PASS: - rendering_mode 3 excluded by default: ✓ - rendering_mode 3 included when flagged: ✓ - Mixed block emits visible: ✓ - All-invisible block produces empty (no spurious \n\n): ✓ - Tr=4 treated same as Tr=3: ✓ Closes pdftract-38p8h		2026-05-28 00:39:37 -04:00
.cargo	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata	2026-05-24 17:31:16 -04:00
.ci/argo-workflows	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
.config	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
.git-hooks	fix(pdftract-5z5d8): add pre-commit hook for provenance validation	2026-05-17 23:50:28 -04:00
.github	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
.marathon	fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters	2026-05-25 19:45:42 -04:00
benches	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
build	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
ci	feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate	2026-05-24 10:52:41 -04:00
crates	fix(pdftract-38p8h): add fallback for empty block.spans in invisible text filter	2026-05-28 00:39:37 -04:00
distribution	feat(pdftract-1eaxm): implement libpdftract C FFI library	2026-05-23 08:55:12 -04:00
docs	docs(pdftract-2bfgc): add sample nginx and Traefik reverse-proxy configs	2026-05-28 00:37:34 -04:00
examples	feat(pdftract-3zhf): add unified TableDetector::detect entry point	2026-05-24 00:51:59 -04:00
fuzz	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
notes	fix(pdftract-38p8h): add fallback for empty block.spans in invisible text filter	2026-05-28 00:39:37 -04:00
pdftract-dotnet	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
pdftract-go	fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup	2026-05-20 19:08:14 -04:00
pdftract-java	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-node	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
profiles/builtin	feat(pdftract-1t5sj): implement book_chapter profile with fixtures and tests	2026-05-27 22:30:09 -04:00
proptest-regressions	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
scripts	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
src	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
templates/sdk-skeleton	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
tests	feat(pdftract-1t5sj): implement book_chapter profile with fixtures and tests	2026-05-27 22:30:09 -04:00
tools	feat(bf-2ervu): implement mmap-backed PdfSource via memmap2	2026-05-24 08:40:11 -04:00
xtask	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata	2026-05-24 17:31:16 -04:00
.gitignore	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
.needle-predispatch-sha	feat(pdftract-1t5sj): implement book_chapter profile with fixtures and tests	2026-05-27 22:30:09 -04:00
.nextest.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
.renovaterc.json	docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule	2026-05-20 18:22:03 -04:00
audit.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
Cargo-dist.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
Cargo.lock	feat(pdftract-1t5sj): implement book_chapter profile with fixtures and tests	2026-05-27 22:30:09 -04:00
Cargo.toml	feat(pdftract-3h9xo): implement threads JSON output + schema integration	2026-05-25 13:40:15 -04:00
CHANGELOG.md	feat(pdftract-2w02): implement MSRV gate with CI check	2026-05-20 19:03:53 -04:00
CLAUDE.md	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
clippy.toml	feat(pdftract-xzfkt): implement caption block classifier	2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
conformance_test	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
CONTRIBUTING.md	docs(contributing): add Argo-CI caveat, DCO sign-off, and contributor templates	2026-05-24 06:00:48 -04:00
Cross.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
deny.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
Dockerfile	feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support	2026-05-20 19:17:49 -04:00
libstdin.rlib	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
LICENSE-APACHE	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
LICENSE-MIT	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
mod	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
pdftract-test-merged.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
SECURITY.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
test_api_null.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_classifier_corpus	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_empty	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_flate.rs	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
test_page_class	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_pdf	feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback	2026-05-23 20:53:25 -04:00
test_trailer_parsing.rs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump

Output

{
  "pages": [
    {
      "page": 1,
      "blocks": [
        { "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
        { "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
      ],
      "spans": [
        { "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
      ]
    }
  ],
  "metadata": { "title": "...", "author": "...", "page_count": 10 }
}

Usage

pdftract extract invoice.pdf            # structured JSON to stdout
pdftract extract invoice.pdf --text     # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080              # HTTP service: POST /extract

Installation

cargo binstall (recommended, fastest)

If you have Rust toolchain installed, the quickest way to get a prebuilt binary is via cargo binstall:

cargo install cargo-binstall
cargo binstall pdftract

This downloads the appropriate binary for your platform from the GitHub Releases (2-3 seconds) instead of compiling from source.

Pre-built binaries

Download directly from GitHub Releases:

Linux (x86_64): pdftract-v*-x86_64-unknown-linux-musl.tar.gz
macOS (Apple Silicon): pdftract-v*-aarch64-apple-darwin.tar.gz
macOS (Intel): pdftract-v*-x86_64-apple-darwin.tar.gz
Windows: pdftract-v*-x86_64-pc-windows-gnu.zip

Build from source

cargo install pdftract --features full-render,ocr

See docs/notes/ for language-specific SDK installation examples (Python, Node.js, Go, Ruby, Java, Rust, Bash).

Architecture

Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.

See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.

Verifying Releases

All releases are signed using Sigstore keyless signing with OIDC from the iad-ci cluster. This provides cryptographic proof that artifacts were produced by the official CI/CD pipeline and haven't been tampered with.

Verify Binary Archives

To verify downloaded binary archives:

# Download release artifacts
gh release download vX.Y.Z --dir /tmp/pdftract-release

# Verify the SHA256SUMS signature
cosign verify-blob \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  --signature SHA256SUMS.sig \
  --certificate SHA256SUMS.pem \
  SHA256SUMS

# Verify individual artifacts against checksums
sha256sum -c SHA256SUMS

Verify Docker Images

To verify Docker images before running them:

# Verify the main image
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:X.Y.Z

# Verify the OCR variant
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:ocr-X.Y.Z

# Verify the full variant
cosign verify \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  ghcr.io/jedarden/pdftract:full-X.Y.Z

View SLSA Provenance

Each Docker image includes SLSA provenance attestation:

cosign verify-attestation \
  --certificate-identity-regexp 'https://iad-ci-oidc.ardenone.com.*' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com' \
  --type slsaprovenance \
  ghcr.io/jedarden/pdftract:X.Y.Z

The provenance includes the build configuration, source commit, and builder identity.

Security

For responsible disclosure of security vulnerabilities, please email security@jedarden.com. See SECURITY.md for our disclosure policy, supported versions, and PGP key for encrypted reports.

PGP Key: The public key for security@jedarden.com is available at docs/security/pgp-public-key.asc.

NOTE: The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for security@jedarden.com. See docs/security/pgp-public-key.asc for generation instructions.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for:

Development setup and build instructions
Local validation checklist before opening a PR
Commit message style (Conventional Commits)
CI on forks (maintainer-triggered Argo workflow)
DCO sign-off requirement

By participating in this project, you agree to abide by our Code of Conduct.

Status

Early development. See docs/plan/ for the implementation roadmap.