A PDF text extraction library that gets the hard parts right.
Created tests/fixtures/scanned/ directory structure for WER gate testing: - README.md: Corpus overview and WER targets (<3% on clean 300-DPI scans) - GEN_MANIFEST.md: Fixture specifications and generation checklist - receipt/receipt-300dpi.txt: Ground truth for AS-02 test scenario (37 lines) - documents/invoice-300dpi.txt: Business invoice ground truth (55 lines) - documents/form-300dpi.txt: Employment application form (78 lines) - multi-page/doc-10page-300dpi.txt: Performance fixture (255 lines, 10 pages) Generation tools: - generate_scanned_fixtures.py: Python script for PDF generation - generate_scanned_fixtures.rs: Rust alternative for fixture metadata - calculate_wer.py: WER/CER calculation utility for OCR validation Test stub: - wer_gate_stub.rs: Placeholder for WER gate tests (marked #[ignore]) Total ground-truth content: 425 lines across 4 fixtures Next steps: 1. Generate PDFs from ground truth using generation script 2. Verify WER < 3% on generated fixtures 3. Enable WER gate tests Closes bf-2he4t |
||
|---|---|---|
| .cargo | ||
| .ci | ||
| .config | ||
| .git-hooks | ||
| .github | ||
| .marathon | ||
| benches | ||
| build | ||
| ci | ||
| crates | ||
| distribution | ||
| docs | ||
| examples | ||
| fuzz | ||
| notes | ||
| pdftract-dotnet | ||
| pdftract-go | ||
| pdftract-java | ||
| pdftract-node | ||
| profiles/builtin | ||
| proptest-regressions | ||
| scripts | ||
| src | ||
| templates/sdk-skeleton | ||
| tests | ||
| tools | ||
| xtask | ||
| .gitignore | ||
| .needle-predispatch-sha | ||
| .nextest.toml | ||
| .renovaterc.json | ||
| 0 | ||
| assess_doc_coverage.py | ||
| audit.toml | ||
| audit_docs.py | ||
| cargo-deny.toml | ||
| Cargo-dist.toml | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CHANGELOG.md | ||
| check_doc_coverage.sh | ||
| check_docs.py | ||
| check_examples.py | ||
| CLAUDE.md | ||
| clippy.toml | ||
| CODE_OF_CONDUCT.md | ||
| conformance_test | ||
| CONTRIBUTING.md | ||
| Cross.toml | ||
| debug_fingerprint_test.rs | ||
| debug_fixtures.py | ||
| debug_fixtures.rs | ||
| debug_parse_simple | ||
| debug_trailer.rs | ||
| deny.toml | ||
| Dockerfile | ||
| fix_fixtures.py | ||
| generate_expected_json.rs | ||
| libstdin.rlib | ||
| LICENSE-APACHE | ||
| LICENSE-MIT | ||
| mod | ||
| out.pdf | ||
| pdftract-test-merged.cdx.json | ||
| README.md | ||
| SECURITY.md | ||
| test_api_null.c | ||
| test_audit_debug.rs | ||
| test_audit_integration.rs | ||
| test_bomb_debug.rs | ||
| test_classifier_corpus | ||
| test_debug_pdf.rs | ||
| test_empty | ||
| test_empty.c | ||
| test_extract.rs | ||
| test_fingerprint_debug.rs | ||
| test_fixture_debug.py | ||
| test_flate.rs | ||
| test_page_class | ||
| test_pdf | ||
| test_stream_decode.rs | ||
| test_trailer.rs | ||
| test_trailer_key.rs | ||
| test_trailer_parsing.rs | ||
pdftract
A PDF text extraction library that gets the hard parts right.
Platform Support
| Platform | Status |
|---|---|
| Linux x86_64 | Fully CI-tested (gating CI on every PR) |
| Linux aarch64 | Fully CI-tested |
| macOS x86_64 | Build-tested; manually smoke-tested per release |
| macOS aarch64 | Build-tested; manually smoke-tested per release |
| Windows x86_64 | Build-tested; manually smoke-tested per release |
Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.
Installation
cargo
cargo install pdftract
pip
pip install pdftract
Docker
docker pull ronaldraygun/pdftract:latest
Homebrew
brew install pdftract
Quickstart
Rust
use pdftract_core::{extract_pdf, ExtractionOptions};
let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);
Python
import pdftract
doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")
CLI
pdftract extract file.pdf --json result.json # JSON output
pdftract extract file.pdf --text - # Plain text to stdout
pdftract serve --port 8080 # HTTP microservice
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
- Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score
Documentation
- User docs: docs/user-docs (mdBook)
- API reference: docs.rs/pdftract
- Contributing guide: CONTRIBUTING.md
- Security policy: SECURITY.md
- Changelog: CHANGELOG.md
- License: LICENSE-MIT or LICENSE-APACHE
License
Licensed under either of:
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
at your option.