pdftract/docs/user-docs/src/quickstart.md
jedarden d9d21df157 docs(pdftract-653ah): add runbook integration for pdftract doctor
- Created docs/operations/manual-platform-smoke.md with comprehensive
  smoke test runbook for KU-12 quarterly manual platform testing
- Added troubleshooting table covering all 14 doctor checks
- Cross-referenced runbook from installation.md and quickstart.md
- Added CI gate test (doctor_runbook_coverage.rs) to verify
  troubleshooting table completeness

Acceptance criteria:
✓ Step 1: pdftract doctor as first section in runbook
✓ Troubleshooting table covers all FAIL-capable checks
✓ installation.md mentions pdftract doctor with runbook link
✓ quickstart.md uses pdftract doctor as first example command
✓ CI gate parses runbook and asserts all checks are present
✓ mdBook build succeeds
✓ No broken internal links

Closes: pdftract-653ah
2026-05-24 13:26:31 -04:00

5.2 KiB

Quickstart

This five-minute walkthrough covers the core pdftract workflow: extract text from a PDF, inspect the structured JSON output, and try profile-based extraction.

Five-Minute Walkthrough

Prerequisites

  • pdftract installed (see Installation)
  • A PDF file to extract (any PDF will do)

If you don't have a PDF handy, you can use the sample fixtures from the pdftract repository:

git clone https://github.com/jedarden/pdftract.git
cd pdftract

Verify Your Environment

Before extracting, verify your environment is properly configured:

pdftract doctor

Expected output:

Check                         Status  Detail
─────────────────────────────────────────────
pdftract binary               OK      0.1.0 (git: abc1234)
tesseract install             OK      v5.3.0
...

If any check shows FAIL, see the Operations Runbook for resolution steps.

Extract Your First PDF

The simplest extraction outputs plain text to stdout:

pdftract extract path/to/document.pdf

For structured JSON output (default):

pdftract extract path/to/document.pdf --output result.json

Or view JSON directly in your terminal (pipe to jq for pretty-printing):

pdftract extract path/to/document.pdf | jq .

Inspect the Output

The JSON output contains:

  • pages — Array of page objects, each with blocks and spans
  • blocks — Semantic elements (headings, paragraphs, lists) with reading order
  • spans — Text fragments with bounding boxes, font metadata, and confidence scores
  • metadata — Document title, author, page count, PDF version

Example:

{
  "pages": [
    {
      "page": 1,
      "width": 612,
      "height": 792,
      "blocks": [
        {
          "kind": "heading",
          "text": "Introduction",
          "bbox": [72, 680, 400, 700],
          "level": 1
        },
        {
          "kind": "paragraph",
          "text": "This is the first paragraph...",
          "bbox": [72, 640, 540, 670]
        }
      ],
      "spans": [
        {
          "text": "Introduction",
          "bbox": [72, 680, 400, 700],
          "font": "Times-Bold",
          "size": 14.0,
          "confidence": 0.99
        }
      ]
    }
  ],
  "metadata": {
    "title": "Sample Document",
    "author": "John Doe",
    "page_count": 1,
    "pdf_version": "1.4"
  }
}

Try Auto-Profile Mode

pdftract includes built-in profiles for common document types (invoices, receipts, contracts, etc.). Use --auto to automatically detect the profile:

pdftract extract invoice.pdf --auto

The auto-detected profile is logged to stderr:

[INFO] Detected profile: invoice

Profiles optimize extraction for specific document layouts:

  • invoice — Extract line items, totals, vendor info
  • receipt — Extract merchant, date, line items, tax, total
  • contract — Extract parties, effective date, clauses
  • bank_statement — Extract account info, statement period, transactions

See Profiles for the full list.

Batch Processing

To extract multiple PDFs in a folder:

pdftract extract *.pdf --output-dir results/

Each PDF produces a corresponding JSON file in results/:

results/
  invoice1.pdf.json
  invoice2.pdf.json
  receipt.pdf.json

For recursive folder processing, use the grep command to search across all PDFs:

pdftract grep "search term" /path/to/folder

This outputs matching filenames and page numbers:

invoice.pdf:3: "search term" found on page 3
receipt.pdf:1: "search term" found on page 1

Common Options

Option Description
--output FILE Write output to file instead of stdout
--text Output plain text instead of JSON
--output-dir DIR Directory for batch output (with * glob)
--auto Auto-detect and apply document profile
--profile NAME Use specific profile (skip auto-detection)
--password PASS Password for encrypted PDFs
--pages N-M Extract specific page range
--ocr Force OCR mode for all pages

See CLI Reference for complete command documentation.

What's Next?

Troubleshooting

Extraction fails with "unsupported encryption"

The PDF is encrypted with a password. Use --password:

pdftract extract encrypted.pdf --password yourpassword

Output has wrong reading order

Some PDFs have malformed internal structure. Try --auto to enable profile-based layout recovery, or use --ocr to force OCR-based extraction.

Poor accuracy on scanned documents

Ensure the OCR features are enabled. The Docker :ocr and :full images include Tesseract. If building from source, enable the ocr feature:

cargo install pdftract --features ocr

For more help, see Troubleshooting.