Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

What pdftract Does

pdftract is a PDF text extraction library that gets the hard parts right. Unlike naive PDF parsers that dump text in the order it appears in the PDF file (which is rarely the correct reading order), pdftract understands document layout and recovers the logical structure that humans perceive when reading a page.

Core Features

Correct reading order — Layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order. pdftract groups text into semantic blocks (headings, paragraphs, lists, tables) and outputs them in the order a human would read.

Font encoding recovery — When ToUnicode CMaps are absent, wrong, or incomplete (a common problem in PDFs generated by legacy tools), pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching. This means you get readable Unicode text even from broken PDFs.

Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree. pdftract reads this directly when present, producing accurate semantic output at no extra cost. Tagged PDFs yield near-perfect extraction.

Per-page hybrid routing — Each page is independently classified and routed to the appropriate pipeline: vector text extraction (for pages with embedded fonts), full OCR (for scanned pages), or assisted OCR where vector hints improve raster accuracy. This hybrid approach optimizes for both accuracy and speed.

Structured output with provenance — The primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump. You get rich metadata that enables downstream processing: layout analysis, font-aware styling, highlight extraction, and confidence-based filtering.

What You Can Extract

  • Text — Plain text or structured JSON with per-character provenance
  • Layout — Bounding boxes for blocks, lines, and spans
  • Metadata — Title, author, creation date, page count, PDF version
  • Structure — Headings, paragraphs, lists, tables (when present in the PDF)
  • Annotations — Comments, highlights, form fields (Phase 7)

What pdftract Does Not Do

pdftract is deliberately scoped. The following features are not in scope for v1.0.0:

Non-goalAlternative
PDF authoring or writinglopdf, pdfium-render, printpdf
Full PDF rendering / printingPDFium, MuPDF, Poppler
Cryptographic signature validationopenssl smime, dedicated PKI libraries
Translation of extracted textLibreTranslate, DeepL, Argos
SummarizationLLM tools via the MCP server integration
OCR engine trainingTesseract’s tesstrain tooling
Filling out PDF formsForm-filling tools with authoring support
Watermark removalDetected and excluded from output, not removed from PDF
Password crackingpdfcrack, john

For the full rationale and scope-lock doctrine, see the Non-Goals section in the project plan.

Supported PDF Features

pdftract supports PDF 1.4 through PDF 2.0, with varying levels of feature coverage:

  • Text extraction — Full support for Type 1, TrueType, OpenType, and CID-keyed fonts
  • Compression — All standard filters (FlateDecode, ASCIIHex, ASCII85, RunLength, CCITT, DCT)
  • Encryption — RC4 40-bit, RC4 128-bit, AES-128, AES-256 (password required)
  • Structure trees — PDF/UA logical structure reading
  • Forms — AcroForm and XFA field extraction (read-only)
  • Signatures — Signature metadata extraction (validation not performed)
  • Attachments — File attachment extraction
  • Articles — Thread extraction for logical reading flows

See the Advanced Topics section for deep dives into specific features.