A PDF text extraction library that gets the hard parts right.

Find a file

jedarden f731ffee4a docs: improve README for clarity and discoverability Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-06-24 15:50:54 -04:00
.cargo	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata	2026-05-24 17:31:16 -04:00
.ci	feat(pdftract-5lvpu): add Swift SDK publish Argo workflow	2026-06-01 10:47:20 -04:00
.claude/worktrees	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
.config	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
.git-hooks	fix(pdftract-5z5d8): add pre-commit hook for provenance validation	2026-05-17 23:50:28 -04:00
.github	ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only)	2026-05-28 08:48:06 -04:00
.marathon	fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters	2026-05-25 19:45:42 -04:00
benches	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
build	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
ci	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
crates	feat(bf-1vv5n): add Roboto font fingerprint entries to font-fingerprints.json	2026-06-08 20:31:30 -04:00
distribution	feat(pdftract-1eaxm): implement libpdftract C FFI library	2026-05-23 08:55:12 -04:00
docs	docs(pdftract-1j0f8): update CLI reference generation command	2026-06-08 17:08:24 -04:00
examples	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
fuzz	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
notes	docs(bf-3ourh): verify CJK fixtures exist and document in PROVENANCE	2026-06-24 12:35:47 -04:00
pdftract-dotnet	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
pdftract-go	fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup	2026-05-20 19:08:14 -04:00
pdftract-java	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-node	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-php	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
pdftract-ruby	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
pdftract-swift	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
profiles/builtin	feat(profiles): add profile infrastructure and initial fixtures	2026-05-31 15:10:51 -04:00
proptest-regressions	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
scratch	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
scripts	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
sdk/php	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
src	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
swift-sdk	docs(pdftract-5lvpu): verify Swift SDK implementation for v1.1+ release	2026-06-01 13:40:03 -04:00
templates/sdk-skeleton	fix(pdftract-5lvpu): add lc_first filter to Swift method names for proper naming	2026-06-01 11:44:14 -04:00
tests	docs: improve README for clarity and discoverability	2026-06-24 15:50:54 -04:00
tools	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
xtask	fix(pdftract-1j0f8): prevent newline accumulation in CLI reference generator	2026-06-08 16:00:28 -04:00
--1.ppm	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
.gitignore	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
.needle-predispatch-sha	docs(pdftract-340): add SDK Architecture epic verification note	2026-06-08 15:33:18 -04:00
.nextest.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
.renovaterc.json	docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule	2026-05-20 18:22:03 -04:00
0	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
assess_doc_coverage.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
audit.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
audit_docs.py	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
cargo-deny.toml	feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06	2026-05-31 16:53:31 -04:00
Cargo-dist.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
Cargo.lock	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
Cargo.toml	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
CHANGELOG.md	feat(pdftract-2w02): implement MSRV gate with CI check	2026-05-20 19:03:53 -04:00
check_content.py	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
check_doc_coverage.sh	fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs	2026-06-01 04:14:05 -04:00
check_docs.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
check_examples.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
CLAUDE.md	docs(bf-9d8a5): update CLAUDE.md - bf close --reason now works	2026-06-01 08:12:26 -04:00
clippy.toml	feat(pdftract-xzfkt): implement caption block classifier	2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
conformance_test	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
CONTRIBUTING.md	docs(pdftract-1e5ud): add SDK conformance test documentation	2026-05-31 23:54:14 -04:00
Cross.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
debug_content_streams.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_content_diff.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_detailed.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_example.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_hash.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_test.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
debug_fixtures.py	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
debug_fixtures.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
debug_parse_simple	fix(pdftract-5t92): fix choice value extraction test failures	2026-05-31 14:00:59 -04:00
debug_trailer.rs	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
deny.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
Dockerfile	feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support	2026-05-20 19:17:49 -04:00
fix_fixtures.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
gen_fixtures	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
generate_expected_json.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
libstdin.rlib	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
LICENSE-APACHE	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
LICENSE-MIT	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
measure_doc_coverage.sh	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
mod	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
out.pdf	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(bf-3ourh): verify CJK fixtures exist and document in PROVENANCE	2026-06-24 12:35:47 -04:00
SECURITY.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
test_api_null.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_audit_debug.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_audit_integration.rs	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
test_bomb_debug.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
test_classifier_corpus	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_debug_pdf.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_debug_serialization.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_empty	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_extract.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_fingerprint_debug.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
test_fingerprint_debug_content.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_fixture_debug.py	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
test_flate.rs	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
test_page_class	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_parse_simple	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_pdf	feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback	2026-05-23 20:53:25 -04:00
test_stream_decode.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer_debug	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_debug.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_debug2	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_debug2.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_key.rs	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
test_trailer_parse	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parse.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parse2	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parse2.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parsing.rs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
tmp_fixtures.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00

README.md

pdftract

pdftract is a pure-Rust PDF text extraction library built for the cases where other tools give up: scanned documents, unusual font encodings, multi-column layouts, footnotes, mixed-mode pages, and encrypted files. Where most extractors treat PDF text extraction as a coordinate sort, pdftract runs a full reading-order pipeline — segmenting layout regions, recovering broken font encodings, routing each page to the right extraction mode (vector, OCR, or hybrid), and emitting structured JSON with per-span provenance. If your PDFs are academic papers, legal filings, financial reports, or anything else that wasn't typeset in a word processor, pdftract is what you want.

How it compares

Capability	pdftract	pdfplumber	pypdf	pdfminer
Multi-column reading order	✅ Full layout segmentation	⚠ Heuristic	❌	⚠ Partial
Footnotes & sidebars	✅	❌	❌	❌
Font encoding recovery	✅ Glyph name → fingerprint → shape	⚠ ToUnicode only	⚠ ToUnicode only	⚠ ToUnicode only
Scanned / mixed PDF (OCR)	✅ Per-page hybrid routing	❌	❌	❌
PDF/UA structure tree	✅	❌	⚠ Partial	❌
PDF decryption (RC4/AES)	✅ (`decrypt` feature)	⚠ Partial	⚠ Partial	⚠ Partial
Per-span bounding boxes + confidence	✅	✅	❌	⚠ Partial
Streaming extraction (large files)	✅	❌	❌	❌
CJK scripts	✅ (`cjk` feature)	⚠	⚠	⚠
HTTP microservice mode	✅ (`serve`)	❌	❌	❌
Language	Rust + Python + C ABI	Python	Python	Python

Platform Support

Platform	Status
Linux x86_64	Fully CI-tested on every PR
Linux aarch64	Fully CI-tested on every PR
macOS x86_64	Build-tested; manually smoke-tested per release
macOS aarch64	Build-tested; manually smoke-tested per release
Windows x86_64	Build-tested; manually smoke-tested per release

See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

Minimum Supported Rust Version (MSRV): 1.78

Cargo

cargo add pdftract-core

Or install the CLI:

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("report.pdf", &opts)?;

for page in &doc.pages {
    println!("Page {}: {} spans", page.number, page.spans.len());
}

Streaming extraction for large files:

use pdftract_core::extract_pdf_streaming;

for page in extract_pdf_streaming("large.pdf", &opts)? {
    let page = page?;
    process(page);
}

NDJSON output (one JSON object per page on stdout):

pdftract_core::extract_pdf_ndjson("report.pdf", &opts, std::io::stdout())?;

Python

import pdftract

doc = pdftract.extract("report.pdf")
print(f"{doc['metadata']['page_count']} pages")

for page in doc["pages"]:
    for span in page["spans"]:
        print(span["text"], span["bbox"], span["confidence"])

CLI

# Extract to JSON
pdftract extract report.pdf --json output.json

# Plain text to stdout
pdftract extract report.pdf --text -

# Markdown output
pdftract extract report.pdf --markdown -

# Run as an HTTP microservice (POST /extract, GET /health)
pdftract serve --port 8080

# Compare two PDFs structurally
pdftract compare original.pdf revised.pdf

# Interactive page inspector
pdftract inspect report.pdf --page 3

# Diagnose extraction problems on a file
pdftract doctor report.pdf

# Validate PDF/UA or PDF/A conformance
pdftract validate report.pdf

# Stable content hash (for dedup / cache keys)
pdftract hash report.pdf

# Search for a pattern across pages
pdftract grep "invoice number" report.pdf

# Print page count and dimensions
pdftract pages report.pdf

# Classify each page (vector / scanned / mixed)
pdftract classify report.pdf

# Manage the local extraction cache
pdftract cache --list
pdftract cache --clear

# Migrate the local cache schema
pdftract migrate

# Verify a previously issued extraction receipt
pdftract verify-receipt receipt.json

# Generate client bindings from the C ABI headers
pdftract codegen --lang python

# Start the MCP (Model Context Protocol) server
pdftract mcp

Features

All extraction functionality works out of the box. Optional features unlock heavier dependencies:

Feature	What it adds	Enable with
`ocr`	Tesseract/Leptonica OCR for scanned and mixed pages	`cargo add pdftract-core --features ocr`
`decrypt`	RC4, AES-128, AES-256 PDF decryption	`cargo add pdftract-core --features decrypt`
`cjk`	CJK script support (Chinese, Japanese, Korean)	`cargo add pdftract-core --features cjk`
`full-render`	Full-page rasterization for assisted OCR and inspect UI	`cargo add pdftract-core --features full-render`

In the Python wheel and Docker image, ocr, decrypt, and cjk are pre-enabled.

What it does

Correct reading order. Most PDF extractors sort glyphs by Y then X coordinate. That breaks on multi-column articles, legal documents with sidebars, academic papers with footnotes, and anything typeset in a non-linear flow. pdftract segments each page into layout regions first, orders the regions, then emits text within each region — so the output reads the way a human would.

Font encoding recovery. PDFs can legally omit ToUnicode CMaps and describe only glyph IDs. When that happens, other extractors emit garbage or question marks. pdftract works through a layered recovery pipeline: glyph name lookup (standard and Adobe glyph lists), font fingerprinting against a known-font database, and finally glyph outline shape matching. Most documents that trip up other tools extract cleanly.

Per-page hybrid routing. Each page is independently classified as vector text, fully scanned (image-only), or mixed. Vector pages go through the fast extraction path. Scanned pages go to full OCR. Mixed pages use assisted OCR — vector spans anchor the OCR so it doesn't drift. This means one call handles an entire document regardless of how it was produced.

Structure tree extraction. PDF/UA and PDF/A files carry a logical structure tree (headings, paragraphs, tables, lists) separate from the visual rendering. pdftract reads this directly when present, so accessible PDFs yield structured output without heuristics.

Structured output with provenance. The primary output format is JSON. Every text span carries its bounding box, font name, point size, and a confidence score. This makes pdftract suitable as a preprocessing step for LLM pipelines, document indexing, and data extraction workflows that need to trace output back to the source page.

Streaming extraction. For large files, extract_pdf_streaming yields one page at a time so memory usage stays bounded regardless of document length.

Available SDKs

pdftract ships multiple integration surfaces from a single Rust core:

SDK	Package	Notes
Rust library	`pdftract-core` on crates.io	Primary API
CLI binary	`pdftract` on crates.io	Wraps the library
Python bindings	`pdftract` on PyPI	PyO3-based, wheels for Linux/macOS/Windows
C shared library	`libpdftract`	Stable C ABI; use `pdftract codegen` to generate FFI headers for your language
Docker image	`ronaldraygun/pdftract`	Includes `serve` mode HTTP microservice
HTTP microservice	`pdftract serve`	REST API for language-agnostic integration

Additional language SDK packages (Go, Node.js, Ruby) are in progress, built on top of the C ABI.

Documentation

User guide: pdftract.com
API reference: docs.rs/pdftract-core
Extraction output schema: docs/research/extraction-output-schema.md
SDK architecture: docs/notes/sdk-architecture.md
Changelog: CHANGELOG.md
Contributing: CONTRIBUTING.md
Security policy: SECURITY.md
Releases: GitHub Releases

License

Licensed under either of:

MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)

at your option.