A PDF text extraction library that gets the hard parts right.
Find a file
jedarden f731ffee4a docs: improve README for clarity and discoverability
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-24 15:50:54 -04:00
.cargo feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata 2026-05-24 17:31:16 -04:00
.ci feat(pdftract-5lvpu): add Swift SDK publish Argo workflow 2026-06-01 10:47:20 -04:00
.claude/worktrees fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
.config fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
.git-hooks fix(pdftract-5z5d8): add pre-commit hook for provenance validation 2026-05-17 23:50:28 -04:00
.github ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only) 2026-05-28 08:48:06 -04:00
.marathon fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters 2026-05-25 19:45:42 -04:00
benches fix(pdftract-60h): fix bugs in benchmark runner script 2026-05-18 01:29:41 -04:00
build fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
ci fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
crates feat(bf-1vv5n): add Roboto font fingerprint entries to font-fingerprints.json 2026-06-08 20:31:30 -04:00
distribution feat(pdftract-1eaxm): implement libpdftract C FFI library 2026-05-23 08:55:12 -04:00
docs docs(pdftract-1j0f8): update CLI reference generation command 2026-06-08 17:08:24 -04:00
examples fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
fuzz docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
notes docs(bf-3ourh): verify CJK fixtures exist and document in PROVENANCE 2026-06-24 12:35:47 -04:00
pdftract-dotnet feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
pdftract-go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
pdftract-java feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
pdftract-node feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
pdftract-php feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
pdftract-ruby feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
pdftract-swift fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
profiles/builtin feat(profiles): add profile infrastructure and initial fixtures 2026-05-31 15:10:51 -04:00
proptest-regressions feat(pdftract-33v): implement property tests and nightly fuzz job 2026-05-22 23:13:13 -04:00
scratch fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
scripts fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
sdk/php feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
src feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
swift-sdk docs(pdftract-5lvpu): verify Swift SDK implementation for v1.1+ release 2026-06-01 13:40:03 -04:00
templates/sdk-skeleton fix(pdftract-5lvpu): add lc_first filter to Swift method names for proper naming 2026-06-01 11:44:14 -04:00
tests docs: improve README for clarity and discoverability 2026-06-24 15:50:54 -04:00
tools fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
xtask fix(pdftract-1j0f8): prevent newline accumulation in CLI reference generator 2026-06-08 16:00:28 -04:00
--1.ppm fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
.gitignore feat(pdftract-juc): implement Standard 14 font metrics registry 2026-05-23 14:04:02 -04:00
.needle-predispatch-sha docs(pdftract-340): add SDK Architecture epic verification note 2026-06-08 15:33:18 -04:00
.nextest.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
.renovaterc.json docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule 2026-05-20 18:22:03 -04:00
0 wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
assess_doc_coverage.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
audit.toml feat(pdftract-1xf4d): implement TH-06 supply-chain gate 2026-05-26 17:31:13 -04:00
audit_docs.py fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
cargo-deny.toml feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06 2026-05-31 16:53:31 -04:00
Cargo-dist.toml docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
Cargo.lock feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
Cargo.toml feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
CHANGELOG.md feat(pdftract-2w02): implement MSRV gate with CI check 2026-05-20 19:03:53 -04:00
check_content.py fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
check_doc_coverage.sh fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
check_docs.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
check_examples.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
CLAUDE.md docs(bf-9d8a5): update CLAUDE.md - bf close --reason now works 2026-06-01 08:12:26 -04:00
clippy.toml feat(pdftract-xzfkt): implement caption block classifier 2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates 2026-05-24 13:06:57 -04:00
conformance_test feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
CONTRIBUTING.md docs(pdftract-1e5ud): add SDK conformance test documentation 2026-05-31 23:54:14 -04:00
Cross.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
debug_content_streams.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
debug_fingerprint.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
debug_fingerprint_content_diff.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
debug_fingerprint_detailed.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
debug_fingerprint_example.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
debug_fingerprint_hash.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
debug_fingerprint_test.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_fixtures.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
debug_fixtures.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
debug_parse_simple fix(pdftract-5t92): fix choice value extraction test failures 2026-05-31 14:00:59 -04:00
debug_trailer.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
deny.toml feat(pdftract-1xf4d): implement TH-06 supply-chain gate 2026-05-26 17:31:13 -04:00
Dockerfile feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support 2026-05-20 19:17:49 -04:00
fix_fixtures.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
gen_fixtures fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
generate_expected_json.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
libstdin.rlib fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
LICENSE-APACHE docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
LICENSE-MIT docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
measure_doc_coverage.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
mod feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
out.pdf feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json feat(pdftract-67tm8): implement MCP stdio transport with integration tests 2026-05-23 00:16:42 -04:00
README.md docs(bf-3ourh): verify CJK fixtures exist and document in PROVENANCE 2026-06-24 12:35:47 -04:00
SECURITY.md docs(pdftract-58kz): add security policy documentation 2026-05-20 19:39:24 -04:00
test_api_null.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_audit_debug.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_audit_integration.rs fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
test_bomb_debug.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_classifier_corpus fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
test_debug_pdf.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_debug_serialization.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_empty feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_empty.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_extract.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_fingerprint_debug.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_fingerprint_debug_content.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_fixture_debug.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
test_flate.rs docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
test_page_class fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
test_parse_simple fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_pdf feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
test_stream_decode.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_trailer.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_trailer_debug fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_debug.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_debug2 fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_debug2.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_key.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
test_trailer_parse fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_parse.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_parse2 fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_parse2.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
test_trailer_parsing.rs feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
tmp_fixtures.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00

pdftract

crates.io PyPI docs.rs License MSRV

pdftract is a pure-Rust PDF text extraction library built for the cases where other tools give up: scanned documents, unusual font encodings, multi-column layouts, footnotes, mixed-mode pages, and encrypted files. Where most extractors treat PDF text extraction as a coordinate sort, pdftract runs a full reading-order pipeline — segmenting layout regions, recovering broken font encodings, routing each page to the right extraction mode (vector, OCR, or hybrid), and emitting structured JSON with per-span provenance. If your PDFs are academic papers, legal filings, financial reports, or anything else that wasn't typeset in a word processor, pdftract is what you want.

How it compares

Capability pdftract pdfplumber pypdf pdfminer
Multi-column reading order Full layout segmentation ⚠ Heuristic ⚠ Partial
Footnotes & sidebars
Font encoding recovery Glyph name → fingerprint → shape ⚠ ToUnicode only ⚠ ToUnicode only ⚠ ToUnicode only
Scanned / mixed PDF (OCR) Per-page hybrid routing
PDF/UA structure tree ⚠ Partial
PDF decryption (RC4/AES) (decrypt feature) ⚠ Partial ⚠ Partial ⚠ Partial
Per-span bounding boxes + confidence ⚠ Partial
Streaming extraction (large files)
CJK scripts (cjk feature)
HTTP microservice mode (serve)
Language Rust + Python + C ABI Python Python Python

Platform Support

Platform Status
Linux x86_64 Fully CI-tested on every PR
Linux aarch64 Fully CI-tested on every PR
macOS x86_64 Build-tested; manually smoke-tested per release
macOS aarch64 Build-tested; manually smoke-tested per release
Windows x86_64 Build-tested; manually smoke-tested per release

See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

Minimum Supported Rust Version (MSRV): 1.78

Cargo

cargo add pdftract-core

Or install the CLI:

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("report.pdf", &opts)?;

for page in &doc.pages {
    println!("Page {}: {} spans", page.number, page.spans.len());
}

Streaming extraction for large files:

use pdftract_core::extract_pdf_streaming;

for page in extract_pdf_streaming("large.pdf", &opts)? {
    let page = page?;
    process(page);
}

NDJSON output (one JSON object per page on stdout):

pdftract_core::extract_pdf_ndjson("report.pdf", &opts, std::io::stdout())?;

Python

import pdftract

doc = pdftract.extract("report.pdf")
print(f"{doc['metadata']['page_count']} pages")

for page in doc["pages"]:
    for span in page["spans"]:
        print(span["text"], span["bbox"], span["confidence"])

CLI

# Extract to JSON
pdftract extract report.pdf --json output.json

# Plain text to stdout
pdftract extract report.pdf --text -

# Markdown output
pdftract extract report.pdf --markdown -

# Run as an HTTP microservice (POST /extract, GET /health)
pdftract serve --port 8080

# Compare two PDFs structurally
pdftract compare original.pdf revised.pdf

# Interactive page inspector
pdftract inspect report.pdf --page 3

# Diagnose extraction problems on a file
pdftract doctor report.pdf

# Validate PDF/UA or PDF/A conformance
pdftract validate report.pdf

# Stable content hash (for dedup / cache keys)
pdftract hash report.pdf

# Search for a pattern across pages
pdftract grep "invoice number" report.pdf

# Print page count and dimensions
pdftract pages report.pdf

# Classify each page (vector / scanned / mixed)
pdftract classify report.pdf

# Manage the local extraction cache
pdftract cache --list
pdftract cache --clear

# Migrate the local cache schema
pdftract migrate

# Verify a previously issued extraction receipt
pdftract verify-receipt receipt.json

# Generate client bindings from the C ABI headers
pdftract codegen --lang python

# Start the MCP (Model Context Protocol) server
pdftract mcp

Features

All extraction functionality works out of the box. Optional features unlock heavier dependencies:

Feature What it adds Enable with
ocr Tesseract/Leptonica OCR for scanned and mixed pages cargo add pdftract-core --features ocr
decrypt RC4, AES-128, AES-256 PDF decryption cargo add pdftract-core --features decrypt
cjk CJK script support (Chinese, Japanese, Korean) cargo add pdftract-core --features cjk
full-render Full-page rasterization for assisted OCR and inspect UI cargo add pdftract-core --features full-render

In the Python wheel and Docker image, ocr, decrypt, and cjk are pre-enabled.

What it does

Correct reading order. Most PDF extractors sort glyphs by Y then X coordinate. That breaks on multi-column articles, legal documents with sidebars, academic papers with footnotes, and anything typeset in a non-linear flow. pdftract segments each page into layout regions first, orders the regions, then emits text within each region — so the output reads the way a human would.

Font encoding recovery. PDFs can legally omit ToUnicode CMaps and describe only glyph IDs. When that happens, other extractors emit garbage or question marks. pdftract works through a layered recovery pipeline: glyph name lookup (standard and Adobe glyph lists), font fingerprinting against a known-font database, and finally glyph outline shape matching. Most documents that trip up other tools extract cleanly.

Per-page hybrid routing. Each page is independently classified as vector text, fully scanned (image-only), or mixed. Vector pages go through the fast extraction path. Scanned pages go to full OCR. Mixed pages use assisted OCR — vector spans anchor the OCR so it doesn't drift. This means one call handles an entire document regardless of how it was produced.

Structure tree extraction. PDF/UA and PDF/A files carry a logical structure tree (headings, paragraphs, tables, lists) separate from the visual rendering. pdftract reads this directly when present, so accessible PDFs yield structured output without heuristics.

Structured output with provenance. The primary output format is JSON. Every text span carries its bounding box, font name, point size, and a confidence score. This makes pdftract suitable as a preprocessing step for LLM pipelines, document indexing, and data extraction workflows that need to trace output back to the source page.

Streaming extraction. For large files, extract_pdf_streaming yields one page at a time so memory usage stays bounded regardless of document length.

Available SDKs

pdftract ships multiple integration surfaces from a single Rust core:

SDK Package Notes
Rust library pdftract-core on crates.io Primary API
CLI binary pdftract on crates.io Wraps the library
Python bindings pdftract on PyPI PyO3-based, wheels for Linux/macOS/Windows
C shared library libpdftract Stable C ABI; use pdftract codegen to generate FFI headers for your language
Docker image ronaldraygun/pdftract Includes serve mode HTTP microservice
HTTP microservice pdftract serve REST API for language-agnostic integration

Additional language SDK packages (Go, Node.js, Ruby) are in progress, built on top of the C ABI.

Documentation

License

Licensed under either of:

at your option.