pdftract/README.md
jedarden 56d7c1b3f7 docs(pdftract-5gld): update README with MSRV and enhanced documentation links
- Add MSRV 1.78 to installation section
- Enhance documentation section with descriptive link text
- Ensure all required links are present (user-docs, extraction-output-schema, sdk-architecture, manual-platform-smoke)

Closes pdftract-5gld
2026-06-08 20:00:38 -04:00

4 KiB

pdftract

crates.io docs.rs CI Status License

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform Status
Linux x86_64 Fully CI-tested (gating CI on every PR)
Linux aarch64 Fully CI-tested
macOS x86_64 Build-tested; manually smoke-tested per release
macOS aarch64 Build-tested; manually smoke-tested per release
Windows x86_64 Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

Minimum Supported Rust Version (MSRV): 1.78

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

  • Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
  • Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
  • Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
  • Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
  • Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

License

Licensed under either of:

at your option.