A PDF text extraction library that gets the hard parts right.

Find a file

jedarden 1c6f26ecaa fix(bf-4mkhv): clean up unused imports in hash.rs The bead description mentioned compile errors in hash.rs from API drift, but those errors were either already fixed or misattributed. The API usage was already correct: - compute_fingerprint already takes 3 arguments with source - len() already propagates Result with ? - read_at method already used correctly - Catalog fields accessed via trailer correctly Only cleanup: removed unused std::fs::File and std::io imports. Verification: notes/bf-4mkhv.md		2026-06-01 09:43:48 -04:00
.cargo	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata	2026-05-24 17:31:16 -04:00
.ci	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
.claude/worktrees	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
.config	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
.git-hooks	fix(pdftract-5z5d8): add pre-commit hook for provenance validation	2026-05-17 23:50:28 -04:00
.github	ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only)	2026-05-28 08:48:06 -04:00
.marathon	fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters	2026-05-25 19:45:42 -04:00
benches	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
build	feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06	2026-05-31 16:53:31 -04:00
ci	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
crates	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
distribution	feat(pdftract-1eaxm): implement libpdftract C FFI library	2026-05-23 08:55:12 -04:00
docs	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
examples	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
fuzz	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
notes	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
pdftract-dotnet	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
pdftract-go	fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup	2026-05-20 19:08:14 -04:00
pdftract-java	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-node	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
profiles/builtin	feat(profiles): add profile infrastructure and initial fixtures	2026-05-31 15:10:51 -04:00
proptest-regressions	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
scripts	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
src	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
templates/sdk-skeleton	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
tests	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
tools	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
xtask	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
--1.ppm	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
.gitignore	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
.needle-predispatch-sha	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
.nextest.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
.renovaterc.json	docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule	2026-05-20 18:22:03 -04:00
0	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
assess_doc_coverage.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
audit.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
audit_docs.py	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
cargo-deny.toml	feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06	2026-05-31 16:53:31 -04:00
Cargo-dist.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
Cargo.lock	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
Cargo.toml	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
CHANGELOG.md	feat(pdftract-2w02): implement MSRV gate with CI check	2026-05-20 19:03:53 -04:00
check_content.py	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
check_doc_coverage.sh	fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs	2026-06-01 04:14:05 -04:00
check_docs.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
check_examples.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
CLAUDE.md	docs(bf-9d8a5): update CLAUDE.md - bf close --reason now works	2026-06-01 08:12:26 -04:00
clippy.toml	feat(pdftract-xzfkt): implement caption block classifier	2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
conformance_test	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
CONTRIBUTING.md	docs(pdftract-1e5ud): add SDK conformance test documentation	2026-05-31 23:54:14 -04:00
Cross.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
debug_fingerprint_test.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
debug_fixtures.py	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
debug_fixtures.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
debug_parse_simple	fix(pdftract-5t92): fix choice value extraction test failures	2026-05-31 14:00:59 -04:00
debug_trailer.rs	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
deny.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
Dockerfile	feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support	2026-05-20 19:17:49 -04:00
fix_fixtures.py	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
generate_expected_json.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
libstdin.rlib	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
LICENSE-APACHE	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
LICENSE-MIT	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
measure_doc_coverage.sh	fix(bf-4mkhv): clean up unused imports in hash.rs	2026-06-01 09:43:48 -04:00
mod	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
out.pdf	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(pdftract-4bpph): add README.md with KU-12 caveat, status badges, and quickstart	2026-05-28 08:11:08 -04:00
SECURITY.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
test_api_null.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_audit_debug.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_audit_integration.rs	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
test_bomb_debug.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
test_classifier_corpus	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_debug_pdf.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_empty	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_extract.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_fingerprint_debug.rs	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates	2026-05-30 09:48:14 -04:00
test_fixture_debug.py	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
test_flate.rs	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
test_page_class	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_pdf	feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback	2026-05-23 20:53:25 -04:00
test_stream_decode.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer_key.rs	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
test_trailer_parsing.rs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform	Status
Linux x86_64	Fully CI-tested (gating CI on every PR)
Linux aarch64	Fully CI-tested
macOS x86_64	Build-tested; manually smoke-tested per release
macOS aarch64	Build-tested; manually smoke-tested per release
Windows x86_64	Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

User docs: docs/user-docs (mdBook)
API reference: docs.rs/pdftract
Contributing guide: CONTRIBUTING.md
Security policy: SECURITY.md
Changelog: CHANGELOG.md
License: LICENSE-MIT or LICENSE-APACHE

License

Licensed under either of:

MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)

at your option.