- Add MSRV 1.78 to installation section - Enhance documentation section with descriptive link text - Ensure all required links are present (user-docs, extraction-output-schema, sdk-architecture, manual-platform-smoke) Closes pdftract-5gld
108 lines
4 KiB
Markdown
108 lines
4 KiB
Markdown
# pdftract
|
|
|
|
[](https://crates.io/crates/pdftract)
|
|
[](https://docs.rs/pdftract)
|
|
[](https://github.com/jedarden/pdftract/blob/main/.ci/argo-workflows/pdftract-ci.yaml)
|
|
[](LICENSE-MIT)
|
|
|
|
A PDF text extraction library that gets the hard parts right.
|
|
|
|
## Platform Support
|
|
|
|
| Platform | Status |
|
|
|----------|--------|
|
|
| Linux x86_64 | Fully CI-tested (gating CI on every PR) |
|
|
| Linux aarch64 | Fully CI-tested |
|
|
| macOS x86_64 | Build-tested; manually smoke-tested per release |
|
|
| macOS aarch64 | Build-tested; manually smoke-tested per release |
|
|
| Windows x86_64 | Build-tested; manually smoke-tested per release |
|
|
|
|
> **Note:** Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See [docs/operations/manual-platform-smoke.md](docs/operations/manual-platform-smoke.md) for the per-release smoke procedure.
|
|
|
|
## Installation
|
|
|
|
**Minimum Supported Rust Version (MSRV):** 1.78
|
|
|
|
### cargo
|
|
|
|
```bash
|
|
cargo install pdftract
|
|
```
|
|
|
|
### pip
|
|
|
|
```bash
|
|
pip install pdftract
|
|
```
|
|
|
|
### Docker
|
|
|
|
```bash
|
|
docker pull ronaldraygun/pdftract:latest
|
|
```
|
|
|
|
### Homebrew
|
|
|
|
```bash
|
|
brew install pdftract
|
|
```
|
|
|
|
## Quickstart
|
|
|
|
### Rust
|
|
|
|
```rust
|
|
use pdftract_core::{extract_pdf, ExtractionOptions};
|
|
|
|
let opts = ExtractionOptions::default();
|
|
let doc = extract_pdf("file.pdf", &opts)?;
|
|
println!("Extracted {} pages", doc.metadata.page_count);
|
|
```
|
|
|
|
### Python
|
|
|
|
```python
|
|
import pdftract
|
|
|
|
doc = pdftract.extract("file.pdf")
|
|
print(f"Extracted {doc['metadata']['page_count']} pages")
|
|
```
|
|
|
|
### CLI
|
|
|
|
```bash
|
|
pdftract extract file.pdf --json result.json # JSON output
|
|
pdftract extract file.pdf --text - # Plain text to stdout
|
|
pdftract serve --port 8080 # HTTP microservice
|
|
```
|
|
|
|
## What it does
|
|
|
|
- **Correct reading order** — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
|
|
- **Font encoding recovery** — when `ToUnicode` CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
|
|
- **Structure tree extraction** — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
|
|
- **Per-page hybrid routing** — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
|
|
- **Structured output with provenance** — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score
|
|
|
|
## Documentation
|
|
|
|
- **User docs:** [docs/user-docs](docs/user-docs/) (mdBook) — Comprehensive user guide at [pdftract.com](https://pdftract.com)
|
|
- **API reference:** [docs.rs/pdftract](https://docs.rs/pdftract) — Rust API documentation
|
|
- **Extraction output schema:** [docs/research/extraction-output-schema.md](docs/research/extraction-output-schema.md)
|
|
- **SDK architecture:** [docs/notes/sdk-architecture.md](docs/notes/sdk-architecture.md)
|
|
- **Platform smoke procedure:** [docs/operations/manual-platform-smoke.md](docs/operations/manual-platform-smoke.md)
|
|
- **Releases:** [GitHub Releases](https://github.com/jedarden/pdftract/releases)
|
|
- **crates.io:** [pdftract](https://crates.io/crates/pdftract)
|
|
- **Contributing guide:** [CONTRIBUTING.md](CONTRIBUTING.md)
|
|
- **Security policy:** [SECURITY.md](SECURITY.md)
|
|
- **Changelog:** [CHANGELOG.md](CHANGELOG.md)
|
|
- **License:** [LICENSE-MIT](LICENSE-MIT) or [LICENSE-APACHE](LICENSE-APACHE)
|
|
|
|
## License
|
|
|
|
Licensed under either of:
|
|
|
|
- MIT License ([LICENSE-MIT](LICENSE-MIT) or https://opensource.org/licenses/MIT)
|
|
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
at your option.
|