- book.toml with title, authors, build directory, edit-url-template - src/SUMMARY.md with complete TOC for all planned sections - src/introduction.md: what pdftract does and doesn't do (Non-Goals) - src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim - src/quickstart.md: five-minute walkthrough with executable commands - 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ mdbook build completes cleanly with zero warnings (linkcheck optional). See notes/pdftract-1g87.md for verification details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 KiB
Introduction
What pdftract Does
pdftract is a PDF text extraction library that gets the hard parts right. Unlike naive PDF parsers that dump text in the order it appears in the PDF file (which is rarely the correct reading order), pdftract understands document layout and recovers the logical structure that humans perceive when reading a page.
Core Features
Correct reading order — Layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order. pdftract groups text into semantic blocks (headings, paragraphs, lists, tables) and outputs them in the order a human would read.
Font encoding recovery — When ToUnicode CMaps are absent, wrong, or incomplete (a common problem in PDFs generated by legacy tools), pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching. This means you get readable Unicode text even from broken PDFs.
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree. pdftract reads this directly when present, producing accurate semantic output at no extra cost. Tagged PDFs yield near-perfect extraction.
Per-page hybrid routing — Each page is independently classified and routed to the appropriate pipeline: vector text extraction (for pages with embedded fonts), full OCR (for scanned pages), or assisted OCR where vector hints improve raster accuracy. This hybrid approach optimizes for both accuracy and speed.
Structured output with provenance — The primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump. You get rich metadata that enables downstream processing: layout analysis, font-aware styling, highlight extraction, and confidence-based filtering.
What You Can Extract
- Text — Plain text or structured JSON with per-character provenance
- Layout — Bounding boxes for blocks, lines, and spans
- Metadata — Title, author, creation date, page count, PDF version
- Structure — Headings, paragraphs, lists, tables (when present in the PDF)
- Annotations — Comments, highlights, form fields (Phase 7)
What pdftract Does Not Do
pdftract is deliberately scoped. The following features are not in scope for v1.0.0:
| Non-goal | Alternative |
|---|---|
| PDF authoring or writing | lopdf, pdfium-render, printpdf |
| Full PDF rendering / printing | PDFium, MuPDF, Poppler |
| Cryptographic signature validation | openssl smime, dedicated PKI libraries |
| Translation of extracted text | LibreTranslate, DeepL, Argos |
| Summarization | LLM tools via the MCP server integration |
| OCR engine training | Tesseract's tesstrain tooling |
| Filling out PDF forms | Form-filling tools with authoring support |
| Watermark removal | Detected and excluded from output, not removed from PDF |
| Password cracking | pdfcrack, john |
For the full rationale and scope-lock doctrine, see the Non-Goals section in the project plan.
Supported PDF Features
pdftract supports PDF 1.4 through PDF 2.0, with varying levels of feature coverage:
- Text extraction — Full support for Type 1, TrueType, OpenType, and CID-keyed fonts
- Compression — All standard filters (FlateDecode, ASCIIHex, ASCII85, RunLength, CCITT, DCT)
- Encryption — RC4 40-bit, RC4 128-bit, AES-128, AES-256 (password required)
- Structure trees — PDF/UA logical structure reading
- Forms — AcroForm and XFA field extraction (read-only)
- Signatures — Signature metadata extraction (validation not performed)
- Attachments — File attachment extraction
- Articles — Thread extraction for logical reading flows
See the Advanced Topics section for deep dives into specific features.