A PDF text extraction library that gets the hard parts right.
Find a file
jedarden 144ab783aa docs(pdftract-145s8): update SDK docs with correct API
- Update SDK README.md from draft placeholder to proper content
- Fix rust.md examples to use correct SDK contract functions:
  - extract_pdf -> extract (SDK contract)
  - extract_pdf_streaming -> extract_stream (SDK contract)
  - Remove OutputOptions parameter (not in SDK API)
- Add proper type hints and Path::new for URLs
- Add sample.pdf fixture with provenance entry
- Verify mdBook renders correctly
- Verify cross-references work (MCP, JSON schema, CLI, OCR)
2026-05-31 23:43:05 -04:00
.cargo feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata 2026-05-24 17:31:16 -04:00
.ci fix(ci): fix bench-matrix DAG dep, image registry prefix, workspace members 2026-05-30 09:51:40 -04:00
.config fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
.git-hooks fix(pdftract-5z5d8): add pre-commit hook for provenance validation 2026-05-17 23:50:28 -04:00
.github ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only) 2026-05-28 08:48:06 -04:00
.marathon fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters 2026-05-25 19:45:42 -04:00
benches fix(pdftract-60h): fix bugs in benchmark runner script 2026-05-18 01:29:41 -04:00
build feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06 2026-05-31 16:53:31 -04:00
ci feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate 2026-05-24 10:52:41 -04:00
crates feat(pdftract-2b7ff): implement image_coverage_fraction signal evaluator 2026-05-31 23:42:38 -04:00
distribution feat(pdftract-1eaxm): implement libpdftract C FFI library 2026-05-23 08:55:12 -04:00
docs feat(pdftract-2b7ff): implement image_coverage_fraction signal evaluator 2026-05-31 23:42:38 -04:00
examples wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
fuzz docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
notes feat(pdftract-2b7ff): implement image_coverage_fraction signal evaluator 2026-05-31 23:42:38 -04:00
pdftract-dotnet feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
pdftract-go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
pdftract-java feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
pdftract-node feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
profiles/builtin feat(profiles): add profile infrastructure and initial fixtures 2026-05-31 15:10:51 -04:00
proptest-regressions feat(pdftract-33v): implement property tests and nightly fuzz job 2026-05-22 23:13:13 -04:00
scripts wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
src feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
templates/sdk-skeleton docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
tests docs(pdftract-145s8): update SDK docs with correct API 2026-05-31 23:43:05 -04:00
tools wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
xtask wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
.gitignore feat(pdftract-juc): implement Standard 14 font metrics registry 2026-05-23 14:04:02 -04:00
.needle-predispatch-sha feat(profiles): add profile infrastructure and initial fixtures 2026-05-31 15:10:51 -04:00
.nextest.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
.renovaterc.json docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule 2026-05-20 18:22:03 -04:00
0 wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
assess_doc_coverage.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
audit.toml feat(pdftract-1xf4d): implement TH-06 supply-chain gate 2026-05-26 17:31:13 -04:00
audit_docs.py fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
cargo-deny.toml feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06 2026-05-31 16:53:31 -04:00
Cargo-dist.toml docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
Cargo.lock chore: update Cargo.lock 2026-05-31 15:51:34 -04:00
Cargo.toml fix(ci): fix bench-matrix DAG dep, image registry prefix, workspace members 2026-05-30 09:51:40 -04:00
CHANGELOG.md feat(pdftract-2w02): implement MSRV gate with CI check 2026-05-20 19:03:53 -04:00
check_docs.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
check_examples.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
CLAUDE.md docs: add active GitHub Actions deletion instruction to CI section 2026-05-28 06:57:14 -04:00
clippy.toml feat(pdftract-xzfkt): implement caption block classifier 2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates 2026-05-24 13:06:57 -04:00
conformance_test feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
CONTRIBUTING.md fix(pdftract-4pnmd): build.rs doc comment format string parsing 2026-05-28 14:36:45 -04:00
Cross.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
debug_fingerprint_test.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_fixtures.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
debug_fixtures.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
debug_parse_simple fix(pdftract-5t92): fix choice value extraction test failures 2026-05-31 14:00:59 -04:00
debug_trailer.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
deny.toml feat(pdftract-1xf4d): implement TH-06 supply-chain gate 2026-05-26 17:31:13 -04:00
Dockerfile feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support 2026-05-20 19:17:49 -04:00
fix_fixtures.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
generate_expected_json.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
libstdin.rlib fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
LICENSE-APACHE docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
LICENSE-MIT docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
mod feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
out.pdf feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json feat(pdftract-67tm8): implement MCP stdio transport with integration tests 2026-05-23 00:16:42 -04:00
README.md docs(pdftract-4bpph): add README.md with KU-12 caveat, status badges, and quickstart 2026-05-28 08:11:08 -04:00
SECURITY.md docs(pdftract-58kz): add security policy documentation 2026-05-20 19:39:24 -04:00
test_api_null.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_audit_debug.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_audit_integration.rs fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
test_bomb_debug.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_classifier_corpus fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
test_debug_pdf.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_empty feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_empty.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_extract.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_fingerprint_debug.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_fixture_debug.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
test_flate.rs docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
test_page_class fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
test_pdf feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
test_stream_decode.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_trailer.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_trailer_key.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
test_trailer_parsing.rs feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00

pdftract

crates.io docs.rs CI Status License

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform Status
Linux x86_64 Fully CI-tested (gating CI on every PR)
Linux aarch64 Fully CI-tested
macOS x86_64 Build-tested; manually smoke-tested per release
macOS aarch64 Build-tested; manually smoke-tested per release
Windows x86_64 Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

  • Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
  • Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
  • Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
  • Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
  • Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

License

Licensed under either of:

at your option.