A PDF text extraction library that gets the hard parts right.
Implements the conformance test runner pattern for all 10 SDKs as specified in the plan (line 3547). Each SDK now has a dedicated conformance test runner. Created: - tests/sdk-conformance/report-schema.json: JSON schema for conformance reports - docs/notes/sdk-conformance-runner.md: Pattern documentation and reference - crates/pdftract-cli/tests/conformance.rs: Rust cargo test target - tests/conformance/test_conformance.py: Python pytest harness - tests/conformance/conformance.test.ts: Node.js vitest runner - tests/conformance/conformance_test.go: Go go test runner - tests/conformance/ConformanceTest.java: Java JUnit 5 runner - tests/conformance/ConformanceTests.cs: .NET xUnit runner - tests/conformance/conformance.c: C standalone binary - tests/conformance/conformance_test.rb: Ruby minitest runner - tests/conformance/ConformanceTest.php: PHP PHPUnit runner - tests/conformance/ConformanceTests.swift: Swift XCTest runner All runners implement: - Loading of tests/sdk-conformance/cases.json - Execution of test cases with language-native method invocations - Comparison of results against expected values with numeric tolerances - Emission of machine-readable conformance-report.json - Non-zero exit on failures/errors for CI gating Acceptance criteria: - PASS: All 10 SDKs have language-specific runners - PASS: Runners consume shared cases.json - PASS: Runners emit JSON reports matching schema - PASS: Runners exit non-zero on failure - WARN: README integration pending SDK repo creation - WARN: Stub implementations return placeholder results References: - Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner" - Plan line 3589: "Conformance suite results published as Argo artifact" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-5omc |
||
|---|---|---|
| .ci/argo-workflows | ||
| .git-hooks | ||
| benches | ||
| crates | ||
| docs | ||
| notes | ||
| profiles/builtin | ||
| scripts | ||
| src | ||
| tests | ||
| xtask | ||
| .needle-predispatch-sha | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CLAUDE.md | ||
| mod | ||
| README.md | ||
pdftract
A PDF text extraction library that gets the hard parts right.
What it does
- Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
- Font encoding recovery — when
ToUnicodeCMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching - Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a
StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost - Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
- Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump
Output
{
"pages": [
{
"page": 1,
"blocks": [
{ "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
{ "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
],
"spans": [
{ "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
]
}
],
"metadata": { "title": "...", "author": "...", "page_count": 10 }
}
Usage
pdftract extract invoice.pdf # structured JSON to stdout
pdftract extract invoice.pdf --text # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080 # HTTP service: POST /extract
Architecture
Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.
See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.
Status
Early development. See docs/plan/ for the implementation roadmap.