A PDF text extraction library that gets the hard parts right.
Find a file
jedarden 1c6f26ecaa fix(bf-4mkhv): clean up unused imports in hash.rs
The bead description mentioned compile errors in hash.rs from API drift,
but those errors were either already fixed or misattributed. The API usage
was already correct:
- compute_fingerprint already takes 3 arguments with source
- len() already propagates Result with ?
- read_at method already used correctly
- Catalog fields accessed via trailer correctly

Only cleanup: removed unused std::fs::File and std::io imports.

Verification: notes/bf-4mkhv.md
2026-06-01 09:43:48 -04:00
.cargo feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata 2026-05-24 17:31:16 -04:00
.ci fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
.claude/worktrees fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
.config fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
.git-hooks fix(pdftract-5z5d8): add pre-commit hook for provenance validation 2026-05-17 23:50:28 -04:00
.github ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only) 2026-05-28 08:48:06 -04:00
.marathon fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters 2026-05-25 19:45:42 -04:00
benches fix(pdftract-60h): fix bugs in benchmark runner script 2026-05-18 01:29:41 -04:00
build feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06 2026-05-31 16:53:31 -04:00
ci fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
crates fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
distribution feat(pdftract-1eaxm): implement libpdftract C FFI library 2026-05-23 08:55:12 -04:00
docs fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
examples wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
fuzz docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
notes fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
pdftract-dotnet feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
pdftract-go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
pdftract-java feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
pdftract-node feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
profiles/builtin feat(profiles): add profile infrastructure and initial fixtures 2026-05-31 15:10:51 -04:00
proptest-regressions feat(pdftract-33v): implement property tests and nightly fuzz job 2026-05-22 23:13:13 -04:00
scripts fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
src feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
templates/sdk-skeleton docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
tests fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
tools wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
xtask fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
--1.ppm fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
.gitignore feat(pdftract-juc): implement Standard 14 font metrics registry 2026-05-23 14:04:02 -04:00
.needle-predispatch-sha fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
.nextest.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
.renovaterc.json docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule 2026-05-20 18:22:03 -04:00
0 wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
assess_doc_coverage.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
audit.toml feat(pdftract-1xf4d): implement TH-06 supply-chain gate 2026-05-26 17:31:13 -04:00
audit_docs.py fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
cargo-deny.toml feat(pdftract-e9lz): add cargo-deny.toml and build/CHECKSUMS.sha256 for TH-06 2026-05-31 16:53:31 -04:00
Cargo-dist.toml docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
Cargo.lock fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
Cargo.toml fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
CHANGELOG.md feat(pdftract-2w02): implement MSRV gate with CI check 2026-05-20 19:03:53 -04:00
check_content.py fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
check_doc_coverage.sh fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
check_docs.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
check_examples.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
CLAUDE.md docs(bf-9d8a5): update CLAUDE.md - bf close --reason now works 2026-06-01 08:12:26 -04:00
clippy.toml feat(pdftract-xzfkt): implement caption block classifier 2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates 2026-05-24 13:06:57 -04:00
conformance_test feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
CONTRIBUTING.md docs(pdftract-1e5ud): add SDK conformance test documentation 2026-05-31 23:54:14 -04:00
Cross.toml ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix 2026-05-23 11:37:19 -04:00
debug_fingerprint_test.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_fixtures.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
debug_fixtures.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
debug_parse_simple fix(pdftract-5t92): fix choice value extraction test failures 2026-05-31 14:00:59 -04:00
debug_trailer.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
deny.toml feat(pdftract-1xf4d): implement TH-06 supply-chain gate 2026-05-26 17:31:13 -04:00
Dockerfile feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support 2026-05-20 19:17:49 -04:00
fix_fixtures.py wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
generate_expected_json.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
libstdin.rlib fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
LICENSE-APACHE docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
LICENSE-MIT docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
measure_doc_coverage.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
mod feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
out.pdf feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json feat(pdftract-67tm8): implement MCP stdio transport with integration tests 2026-05-23 00:16:42 -04:00
README.md docs(pdftract-4bpph): add README.md with KU-12 caveat, status badges, and quickstart 2026-05-28 08:11:08 -04:00
SECURITY.md docs(pdftract-58kz): add security policy documentation 2026-05-20 19:39:24 -04:00
test_api_null.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_audit_debug.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_audit_integration.rs fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
test_bomb_debug.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_classifier_corpus fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
test_debug_pdf.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_empty feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_empty.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_extract.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_fingerprint_debug.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_fixture_debug.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
test_flate.rs docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
test_page_class fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
test_pdf feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
test_stream_decode.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_trailer.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_trailer_key.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
test_trailer_parsing.rs feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00

pdftract

crates.io docs.rs CI Status License

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform Status
Linux x86_64 Fully CI-tested (gating CI on every PR)
Linux aarch64 Fully CI-tested
macOS x86_64 Build-tested; manually smoke-tested per release
macOS aarch64 Build-tested; manually smoke-tested per release
Windows x86_64 Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

  • Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
  • Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
  • Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
  • Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
  • Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

License

Licensed under either of:

at your option.