A PDF text extraction library that gets the hard parts right.

Find a file

jedarden 4f651ca9b8 feat(bf-1vv5n): add Roboto font fingerprint entries to font-fingerprints.json - Add SHA-256 hash of Roboto-Regular.ttf (56a45233d29f11b4dfb86d248e921939d115778f87325e7ae8cc108383d6664d) - Map glyph IDs 1-95 to ASCII codepoints 32-126 (space through tilde) - Enables Level 3 Unicode recovery via font fingerprint matching - Verify: cargo build -p pdftract-core passes, checksum verified Closes bf-1vv5n.		2026-06-08 20:31:30 -04:00
.cargo	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata	2026-05-24 17:31:16 -04:00
.ci	feat(pdftract-5lvpu): add Swift SDK publish Argo workflow	2026-06-01 10:47:20 -04:00
.claude/worktrees	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
.config	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
.git-hooks	fix(pdftract-5z5d8): add pre-commit hook for provenance validation	2026-05-17 23:50:28 -04:00
.github	ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only)	2026-05-28 08:48:06 -04:00
.marathon	fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters	2026-05-25 19:45:42 -04:00
benches	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
build	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
ci	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
crates	feat(bf-1vv5n): add Roboto font fingerprint entries to font-fingerprints.json	2026-06-08 20:31:30 -04:00
distribution	feat(pdftract-1eaxm): implement libpdftract C FFI library	2026-05-23 08:55:12 -04:00
docs	docs(pdftract-1j0f8): update CLI reference generation command	2026-06-08 17:08:24 -04:00
examples	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
fuzz	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
notes	docs(pdftract-26v): add epic verification note	2026-06-08 20:19:57 -04:00
pdftract-dotnet	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
pdftract-go	fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup	2026-05-20 19:08:14 -04:00
pdftract-java	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-node	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-php	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
pdftract-ruby	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
pdftract-swift	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
profiles/builtin	feat(profiles): add profile infrastructure and initial fixtures	2026-05-31 15:10:51 -04:00
proptest-regressions	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
scratch	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
scripts	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
sdk/php	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
src	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
swift-sdk	docs(pdftract-5lvpu): verify Swift SDK implementation for v1.1+ release	2026-06-01 13:40:03 -04:00
templates/sdk-skeleton	fix(pdftract-5lvpu): add lc_first filter to Swift method names for proper naming	2026-06-01 11:44:14 -04:00
tests	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
tools	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
xtask	fix(pdftract-1j0f8): prevent newline accumulation in CLI reference generator	2026-06-08 16:00:28 -04:00
--1.ppm	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
.gitignore	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
.needle-predispatch-sha	docs(pdftract-340): add SDK Architecture epic verification note	2026-06-08 15:33:18 -04:00
.nextest.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
.renovaterc.json	docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule	2026-05-20 18:22:03 -04:00
0	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
assess_doc_coverage.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
audit.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
audit_docs.py	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
cargo-deny.toml	feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06	2026-05-31 16:53:31 -04:00
Cargo-dist.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
Cargo.lock	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
Cargo.toml	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing	2026-06-01 10:27:03 -04:00
CHANGELOG.md	feat(pdftract-2w02): implement MSRV gate with CI check	2026-05-20 19:03:53 -04:00
check_content.py	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
check_doc_coverage.sh	fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs	2026-06-01 04:14:05 -04:00
check_docs.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
check_examples.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
CLAUDE.md	docs(bf-9d8a5): update CLAUDE.md - bf close --reason now works	2026-06-01 08:12:26 -04:00
clippy.toml	feat(pdftract-xzfkt): implement caption block classifier	2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
conformance_test	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
CONTRIBUTING.md	docs(pdftract-1e5ud): add SDK conformance test documentation	2026-05-31 23:54:14 -04:00
Cross.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
debug_content_streams.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_content_diff.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_detailed.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_example.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_hash.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
debug_fingerprint_test.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
debug_fixtures.py	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
debug_fixtures.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
debug_parse_simple	fix(pdftract-5t92): fix choice value extraction test failures	2026-05-31 14:00:59 -04:00
debug_trailer.rs	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
deny.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
Dockerfile	feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support	2026-05-20 19:17:49 -04:00
fix_fixtures.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
gen_fixtures	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
generate_expected_json.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
libstdin.rlib	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
LICENSE-APACHE	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
LICENSE-MIT	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
measure_doc_coverage.sh	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
mod
out.pdf	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(pdftract-5gld): update README with MSRV and enhanced documentation links	2026-06-08 20:00:38 -04:00
SECURITY.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
test_api_null.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_audit_debug.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_audit_integration.rs	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
test_bomb_debug.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
test_classifier_corpus	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_debug_pdf.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_debug_serialization.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_empty	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_extract.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_fingerprint_debug.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
test_fingerprint_debug_content.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_fixture_debug.py	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
test_flate.rs	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
test_page_class	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_parse_simple	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_pdf	feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback	2026-05-23 20:53:25 -04:00
test_stream_decode.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer_debug	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_debug.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_debug2	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_debug2.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_key.rs	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
test_trailer_parse	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parse.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parse2	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parse2.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
test_trailer_parsing.rs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
tmp_fixtures.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform	Status
Linux x86_64	Fully CI-tested (gating CI on every PR)
Linux aarch64	Fully CI-tested
macOS x86_64	Build-tested; manually smoke-tested per release
macOS aarch64	Build-tested; manually smoke-tested per release
Windows x86_64	Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

Minimum Supported Rust Version (MSRV): 1.78

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

User docs: docs/user-docs (mdBook) — Comprehensive user guide at pdftract.com
API reference: docs.rs/pdftract — Rust API documentation
Extraction output schema: docs/research/extraction-output-schema.md
SDK architecture: docs/notes/sdk-architecture.md
Platform smoke procedure: docs/operations/manual-platform-smoke.md
Releases: GitHub Releases
crates.io: pdftract
Contributing guide: CONTRIBUTING.md
Security policy: SECURITY.md
Changelog: CHANGELOG.md
License: LICENSE-MIT or LICENSE-APACHE

License

Licensed under either of:

MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)

at your option.