A PDF text extraction library that gets the hard parts right.
Find a file
jedarden b535638104 feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.

Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields

Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences

All 27 tests pass including proptests for fuzzing robustness (INV-8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:45:45 -04:00
crates/pdftract-core feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
docs docs(pdftract-147a): author SDK contract specification 2026-05-17 23:13:55 -04:00
notes feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
profiles/builtin fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling 2026-05-17 23:22:39 -04:00
scripts fix(pdftract-5z5d8): fix provenance validation script 2026-05-17 23:43:37 -04:00
src feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
tests fix(pdftract-5z5d8): fix provenance validation script 2026-05-17 23:43:37 -04:00
xtask feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
.needle-predispatch-sha feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
Cargo.lock feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
Cargo.toml feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
CLAUDE.md feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
mod feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree 2026-05-17 23:45:45 -04:00
README.md Rewrite README to lead with capabilities, drop competitor references 2026-05-16 14:46:33 -04:00

pdftract

A PDF text extraction library that gets the hard parts right.

What it does

  • Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order
  • Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching
  • Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree; pdftract reads this directly when present, producing accurate semantic output at no extra cost
  • Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR where vector hints improve raster accuracy
  • Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump

Output

{
  "pages": [
    {
      "page": 1,
      "blocks": [
        { "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700] },
        { "kind": "paragraph", "text": "...", "bbox": [72, 640, 540, 670] }
      ],
      "spans": [
        { "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 }
      ]
    }
  ],
  "metadata": { "title": "...", "author": "...", "page_count": 10 }
}

Usage

pdftract extract invoice.pdf            # structured JSON to stdout
pdftract extract invoice.pdf --text     # plain text to stdout
pdftract extract invoice.pdf --output out.json
pdftract serve --port 8080              # HTTP service: POST /extract

Architecture

Rust core with PyO3 Python bindings and a CLI binary. The same binary runs as a command-line tool or as an HTTP microservice — the container deployment is just pdftract serve.

See docs/research/ for technical deep-dives into the PDF specification, font encoding, glyph Unicode recovery, and tagged PDF structure. See docs/notes/ for SDK invocation examples in Python, Node.js, Go, Ruby, Java, Rust, and Bash.

Status

Early development. See docs/plan/ for the implementation roadmap.