jedarden a34f9c18d0 docs(pdftract-1g87): create mdBook scaffolding for user documentation

- book.toml with title, authors, build directory, edit-url-template
- src/SUMMARY.md with complete TOC for all planned sections
- src/introduction.md: what pdftract does and doesn't do (Non-Goals)
- src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim
- src/quickstart.md: five-minute walkthrough with executable commands
- 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ

mdbook build completes cleanly with zero warnings (linkcheck optional).

See notes/pdftract-1g87.md for verification details.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 00:38:51 -04:00

4 KiB

Raw Blame History

Introduction

What pdftract Does

pdftract is a PDF text extraction library that gets the hard parts right. Unlike naive PDF parsers that dump text in the order it appears in the PDF file (which is rarely the correct reading order), pdftract understands document layout and recovers the logical structure that humans perceive when reading a page.

Core Features

Correct reading order — Layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents without relying on PDF operator order. pdftract groups text into semantic blocks (headings, paragraphs, lists, tables) and outputs them in the order a human would read.

Font encoding recovery — When ToUnicode CMaps are absent, wrong, or incomplete (a common problem in PDFs generated by legacy tools), pdftract works through a layered recovery pipeline: glyph name lookup via the Adobe Glyph List, font fingerprinting against known metrics and embedded checksums, and glyph outline shape matching. This means you get readable Unicode text even from broken PDFs.

Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure (headings, paragraphs, lists, tables, reading order) in a StructTree. pdftract reads this directly when present, producing accurate semantic output at no extra cost. Tagged PDFs yield near-perfect extraction.

Per-page hybrid routing — Each page is independently classified and routed to the appropriate pipeline: vector text extraction (for pages with embedded fonts), full OCR (for scanned pages), or assisted OCR where vector hints improve raster accuracy. This hybrid approach optimizes for both accuracy and speed.

Structured output with provenance — The primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score alongside the extracted text, not a flat string dump. You get rich metadata that enables downstream processing: layout analysis, font-aware styling, highlight extraction, and confidence-based filtering.

What You Can Extract

Text — Plain text or structured JSON with per-character provenance
Layout — Bounding boxes for blocks, lines, and spans
Metadata — Title, author, creation date, page count, PDF version
Structure — Headings, paragraphs, lists, tables (when present in the PDF)
Annotations — Comments, highlights, form fields (Phase 7)

What pdftract Does Not Do

pdftract is deliberately scoped. The following features are not in scope for v1.0.0:

Non-goal	Alternative
PDF authoring or writing	`lopdf`, `pdfium-render`, `printpdf`
Full PDF rendering / printing	PDFium, MuPDF, Poppler
Cryptographic signature validation	`openssl smime`, dedicated PKI libraries
Translation of extracted text	LibreTranslate, DeepL, Argos
Summarization	LLM tools via the MCP server integration
OCR engine training	Tesseract's `tesstrain` tooling
Filling out PDF forms	Form-filling tools with authoring support
Watermark removal	Detected and excluded from output, not removed from PDF
Password cracking	`pdfcrack`, `john`

For the full rationale and scope-lock doctrine, see the Non-Goals section in the project plan.

Supported PDF Features

pdftract supports PDF 1.4 through PDF 2.0, with varying levels of feature coverage:

Text extraction — Full support for Type 1, TrueType, OpenType, and CID-keyed fonts
Compression — All standard filters (FlateDecode, ASCIIHex, ASCII85, RunLength, CCITT, DCT)
Encryption — RC4 40-bit, RC4 128-bit, AES-128, AES-256 (password required)
Structure trees — PDF/UA logical structure reading
Forms — AcroForm and XFA field extraction (read-only)
Signatures — Signature metadata extraction (validation not performed)
Attachments — File attachment extraction
Articles — Thread extraction for logical reading flows

See the Advanced Topics section for deep dives into specific features.

4 KiB Raw Blame History