- Add MSRV 1.78 to installation section - Enhance documentation section with descriptive link text - Ensure all required links are present (user-docs, extraction-output-schema, sdk-architecture, manual-platform-smoke) Closes pdftract-5gld
4 KiB
4 KiB
pdftract
A PDF text extraction library that gets the hard parts right.
Platform Support
| Platform | Status |
|---|---|
| Linux x86_64 | Fully CI-tested (gating CI on every PR) |
| Linux aarch64 | Fully CI-tested |
| macOS x86_64 | Build-tested; manually smoke-tested per release |
| macOS aarch64 | Build-tested; manually smoke-tested per release |
| Windows x86_64 | Build-tested; manually smoke-tested per release |
Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.
Installation
Minimum Supported Rust Version (MSRV): 1.78
cargo
cargo install pdftract
pip
pip install pdftract
Docker
docker pull ronaldraygun/pdftract:latest
Homebrew
brew install pdftract
Quickstart
Rust
use pdftract_core::{extract_pdf, ExtractionOptions};
let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);
Python
import pdftract
doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")
CLI
pdftract extract file.pdf --json result.json # JSON output
pdftract extract file.pdf --text - # Plain text to stdout
pdftract serve --port 8080 # HTTP microservice
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
- Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score
Documentation
- User docs: docs/user-docs (mdBook) — Comprehensive user guide at pdftract.com
- API reference: docs.rs/pdftract — Rust API documentation
- Extraction output schema: docs/research/extraction-output-schema.md
- SDK architecture: docs/notes/sdk-architecture.md
- Platform smoke procedure: docs/operations/manual-platform-smoke.md
- Releases: GitHub Releases
- crates.io: pdftract
- Contributing guide: CONTRIBUTING.md
- Security policy: SECURITY.md
- Changelog: CHANGELOG.md
- License: LICENSE-MIT or LICENSE-APACHE
License
Licensed under either of:
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
at your option.