A PDF text extraction library that gets the hard parts right.

Find a file

jedarden bb7146cffe fix(pdftract-2uk9z): wrap native module results in typed Python objects The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError		2026-05-28 21:18:38 -04:00
.cargo	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata	2026-05-24 17:31:16 -04:00
.ci	feat(pdftract-3990k): log-policy enforcement - NEVER-log secrets	2026-05-28 13:31:04 -04:00
.config	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
.git-hooks	fix(pdftract-5z5d8): add pre-commit hook for provenance validation	2026-05-17 23:50:28 -04:00
.github	ci: remove GitHub Actions workflow (Argo Workflows on iad-ci only)	2026-05-28 08:48:06 -04:00
.marathon	fix(marathon): forbid ad-hoc bare cargo test, mandate nextest filters	2026-05-25 19:45:42 -04:00
benches	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
build	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
ci	feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate	2026-05-24 10:52:41 -04:00
crates	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
distribution	feat(pdftract-1eaxm): implement libpdftract C FFI library	2026-05-23 08:55:12 -04:00
docs	docs(pdftract-19oy): add verification note for codespace parser + tokenizer	2026-05-28 12:26:25 -04:00
examples	chore(pdftract-36glh): remove unused JpxDecoder import and add verification note	2026-05-28 05:23:13 -04:00
fuzz	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
notes	docs(pdftract-4em4l): verify audit logging implementation complete	2026-05-28 21:18:38 -04:00
pdftract-dotnet	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
pdftract-go	fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup	2026-05-20 19:08:14 -04:00
pdftract-java	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
pdftract-node	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
profiles/builtin	feat(pdftract-1t5sj): implement book_chapter profile with fixtures and tests	2026-05-27 22:30:09 -04:00
proptest-regressions	feat(pdftract-33v): implement property tests and nightly fuzz job	2026-05-22 23:13:13 -04:00
scripts	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
src	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
templates/sdk-skeleton	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
tests	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
tools	feat(bf-2ervu): implement mmap-backed PdfSource via memmap2	2026-05-24 08:40:11 -04:00
xtask	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
.gitignore	feat(pdftract-juc): implement Standard 14 font metrics registry	2026-05-23 14:04:02 -04:00
.needle-predispatch-sha	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
.nextest.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
.renovaterc.json	docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule	2026-05-20 18:22:03 -04:00
audit.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
audit_docs.py	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
Cargo-dist.toml	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
Cargo.lock	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
Cargo.toml	fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding	2026-05-28 01:30:33 -04:00
CHANGELOG.md	feat(pdftract-2w02): implement MSRV gate with CI check	2026-05-20 19:03:53 -04:00
CLAUDE.md	docs: add active GitHub Actions deletion instruction to CI section	2026-05-28 06:57:14 -04:00
clippy.toml	feat(pdftract-xzfkt): implement caption block classifier	2026-05-24 01:56:34 -04:00
CODE_OF_CONDUCT.md	docs(pdftract-4618): adopt Contributor Covenant v2.1 and link from templates	2026-05-24 13:06:57 -04:00
conformance_test	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
CONTRIBUTING.md	fix(pdftract-4pnmd): build.rs doc comment format string parsing	2026-05-28 14:36:45 -04:00
Cross.toml	ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix	2026-05-23 11:37:19 -04:00
debug_fixtures.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
deny.toml	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
Dockerfile	feat(pdftract-68pe): add Dockerfile with FEATURES build-arg support	2026-05-20 19:17:49 -04:00
generate_expected_json.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
libstdin.rlib	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
LICENSE-APACHE	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
LICENSE-MIT	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
mod	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree	2026-05-17 23:45:45 -04:00
out.pdf	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
pdftract-test-merged.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(pdftract-4bpph): add README.md with KU-12 caveat, status badges, and quickstart	2026-05-28 08:11:08 -04:00
SECURITY.md	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
test_api_null.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_audit_debug.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_audit_integration.rs	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
test_classifier_corpus	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_debug_pdf.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_empty	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_empty.c	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
test_extract.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_flate.rs	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
test_page_class	fix: resolve compilation errors across codebase	2026-05-25 08:38:04 -04:00
test_pdf	feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback	2026-05-23 20:53:25 -04:00
test_stream_decode.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer.rs	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
test_trailer_parsing.rs	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform	Status
Linux x86_64	Fully CI-tested (gating CI on every PR)
Linux aarch64	Fully CI-tested
macOS x86_64	Build-tested; manually smoke-tested per release
macOS aarch64	Build-tested; manually smoke-tested per release
Windows x86_64	Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

User docs: docs/user-docs (mdBook)
API reference: docs.rs/pdftract
Contributing guide: CONTRIBUTING.md
Security policy: SECURITY.md
Changelog: CHANGELOG.md
License: LICENSE-MIT or LICENSE-APACHE

License

Licensed under either of:

MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)

at your option.