- Created docs/operations/manual-platform-smoke.md with comprehensive smoke test runbook for KU-12 quarterly manual platform testing - Added troubleshooting table covering all 14 doctor checks - Cross-referenced runbook from installation.md and quickstart.md - Added CI gate test (doctor_runbook_coverage.rs) to verify troubleshooting table completeness Acceptance criteria: ✓ Step 1: pdftract doctor as first section in runbook ✓ Troubleshooting table covers all FAIL-capable checks ✓ installation.md mentions pdftract doctor with runbook link ✓ quickstart.md uses pdftract doctor as first example command ✓ CI gate parses runbook and asserts all checks are present ✓ mdBook build succeeds ✓ No broken internal links Closes: pdftract-653ah
5.2 KiB
Quickstart
This five-minute walkthrough covers the core pdftract workflow: extract text from a PDF, inspect the structured JSON output, and try profile-based extraction.
Five-Minute Walkthrough
Prerequisites
- pdftract installed (see Installation)
- A PDF file to extract (any PDF will do)
If you don't have a PDF handy, you can use the sample fixtures from the pdftract repository:
git clone https://github.com/jedarden/pdftract.git
cd pdftract
Verify Your Environment
Before extracting, verify your environment is properly configured:
pdftract doctor
Expected output:
Check Status Detail
─────────────────────────────────────────────
pdftract binary OK 0.1.0 (git: abc1234)
tesseract install OK v5.3.0
...
If any check shows FAIL, see the Operations Runbook for resolution steps.
Extract Your First PDF
The simplest extraction outputs plain text to stdout:
pdftract extract path/to/document.pdf
For structured JSON output (default):
pdftract extract path/to/document.pdf --output result.json
Or view JSON directly in your terminal (pipe to jq for pretty-printing):
pdftract extract path/to/document.pdf | jq .
Inspect the Output
The JSON output contains:
pages— Array of page objects, each withblocksandspansblocks— Semantic elements (headings, paragraphs, lists) with reading orderspans— Text fragments with bounding boxes, font metadata, and confidence scoresmetadata— Document title, author, page count, PDF version
Example:
{
"pages": [
{
"page": 1,
"width": 612,
"height": 792,
"blocks": [
{
"kind": "heading",
"text": "Introduction",
"bbox": [72, 680, 400, 700],
"level": 1
},
{
"kind": "paragraph",
"text": "This is the first paragraph...",
"bbox": [72, 640, 540, 670]
}
],
"spans": [
{
"text": "Introduction",
"bbox": [72, 680, 400, 700],
"font": "Times-Bold",
"size": 14.0,
"confidence": 0.99
}
]
}
],
"metadata": {
"title": "Sample Document",
"author": "John Doe",
"page_count": 1,
"pdf_version": "1.4"
}
}
Try Auto-Profile Mode
pdftract includes built-in profiles for common document types (invoices, receipts, contracts, etc.). Use --auto to automatically detect the profile:
pdftract extract invoice.pdf --auto
The auto-detected profile is logged to stderr:
[INFO] Detected profile: invoice
Profiles optimize extraction for specific document layouts:
- invoice — Extract line items, totals, vendor info
- receipt — Extract merchant, date, line items, tax, total
- contract — Extract parties, effective date, clauses
- bank_statement — Extract account info, statement period, transactions
See Profiles for the full list.
Batch Processing
To extract multiple PDFs in a folder:
pdftract extract *.pdf --output-dir results/
Each PDF produces a corresponding JSON file in results/:
results/
invoice1.pdf.json
invoice2.pdf.json
receipt.pdf.json
For recursive folder processing, use the grep command to search across all PDFs:
pdftract grep "search term" /path/to/folder
This outputs matching filenames and page numbers:
invoice.pdf:3: "search term" found on page 3
receipt.pdf:1: "search term" found on page 1
Common Options
| Option | Description |
|---|---|
--output FILE |
Write output to file instead of stdout |
--text |
Output plain text instead of JSON |
--output-dir DIR |
Directory for batch output (with * glob) |
--auto |
Auto-detect and apply document profile |
--profile NAME |
Use specific profile (skip auto-detection) |
--password PASS |
Password for encrypted PDFs |
--pages N-M |
Extract specific page range |
--ocr |
Force OCR mode for all pages |
See CLI Reference for complete command documentation.
What's Next?
- Explore the CLI Reference for advanced options
- Read JSON Schema Reference for output format details
- Check Profiles for document-type-specific extraction
- Try the Python SDK for programmatic access
Troubleshooting
Extraction fails with "unsupported encryption"
The PDF is encrypted with a password. Use --password:
pdftract extract encrypted.pdf --password yourpassword
Output has wrong reading order
Some PDFs have malformed internal structure. Try --auto to enable profile-based layout recovery, or use --ocr to force OCR-based extraction.
Poor accuracy on scanned documents
Ensure the OCR features are enabled. The Docker :ocr and :full images include Tesseract. If building from source, enable the ocr feature:
cargo install pdftract --features ocr
For more help, see Troubleshooting.