A PDF text extraction library that gets the hard parts right.

Find a file

jedarden 3af009440e fix(pdftract-5z5d8): fix provenance validation script Fixed scripts/check-provenance.sh to properly validate PROVENANCE.md against actual fixture files. The script was failing silently due to subshell EXIT trap removing temp files before parent could read them, and arithmetic expansion returning exit code 1 on zero value. Changes: - Replaced subshell pipes with process substitution - Moved temp file cleanup to after reading - Added validated variable initialization - Added \|\| true to prevent exit on zero arithmetic All 200 classifier corpus fixtures have valid provenance entries with matching SHA256 hashes. PROVENANCE.md already existed with complete documentation. Refs: pdftract-5z5d8 Co-Authored-By: Claude Code <noreply@anthropic.com>		2026-05-17 23:43:37 -04:00
crates/pdftract-core/src/parser/lexer	feat(pdftract-4hn1): use Cow<'static, str> for diagnostic messages	2026-05-17 23:23:38 -04:00
docs	docs(pdftract-147a): author SDK contract specification	2026-05-17 23:13:55 -04:00
notes	fix(pdftract-5z5d8): fix provenance validation script	2026-05-17 23:43:37 -04:00
profiles/builtin	fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling	2026-05-17 23:22:39 -04:00
scripts	fix(pdftract-5z5d8): fix provenance validation script	2026-05-17 23:43:37 -04:00
src	test(classifier): add 200-document labeled corpus for Phase 5.6	2026-05-17 07:16:02 -04:00
tests	fix(pdftract-5z5d8): fix provenance validation script	2026-05-17 23:43:37 -04:00
xtask	fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling	2026-05-17 23:22:39 -04:00
Cargo.lock	test(classifier): add 200-document labeled corpus for Phase 5.6	2026-05-17 07:16:02 -04:00
Cargo.toml	test(classifier): add 200-document labeled corpus for Phase 5.6	2026-05-17 07:16:02 -04:00
README.md	Rewrite README to lead with capabilities, drop competitor references	2026-05-16 14:46:33 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump

Output

{
  "pages": [
    {
      "page": 1,
      "blocks": [
        { "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
        { "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
      ],
      "spans": [
        { "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
      ]
    }
  ],
  "metadata": { "title": "...", "author": "...", "page_count": 10 }
}

Usage

pdftract extract invoice.pdf            # structured JSON to stdout
pdftract extract invoice.pdf --text     # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080              # HTTP service: POST /extract

Architecture

Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.

See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.

Status

Early development. See docs/plan/ for the implementation roadmap.