# Quickstart This five-minute walkthrough covers the core pdftract workflow: extract text from a PDF, inspect the structured JSON output, and try profile-based extraction. ## Five-Minute Walkthrough ### Prerequisites - pdftract installed (see [Installation](./installation.md)) - A PDF file to extract (any PDF will do) If you don't have a PDF handy, you can use the sample fixtures from the pdftract repository: ```bash git clone https://github.com/jedarden/pdftract.git cd pdftract ``` ### Verify Your Environment Before extracting, verify your environment is properly configured: ```bash pdftract doctor ``` Expected output: ``` Check Status Detail ───────────────────────────────────────────── pdftract binary OK 0.1.0 (git: abc1234) tesseract install OK v5.3.0 ... ``` If any check shows FAIL, see the [Operations Runbook](../../operations/manual-platform-smoke.md#troubleshooting) for resolution steps. ### Extract Your First PDF The simplest extraction outputs plain text to stdout: ```bash pdftract extract path/to/document.pdf ``` For structured JSON output (default): ```bash pdftract extract path/to/document.pdf --output result.json ``` Or view JSON directly in your terminal (pipe to `jq` for pretty-printing): ```bash pdftract extract path/to/document.pdf | jq . ``` ### Inspect the Output The JSON output contains: - **`pages`** — Array of page objects, each with `blocks` and `spans` - **`blocks`** — Semantic elements (headings, paragraphs, lists) with reading order - **`spans`** — Text fragments with bounding boxes, font metadata, and confidence scores - **`metadata`** — Document title, author, page count, PDF version Example: ```json { "pages": [ { "page": 1, "width": 612, "height": 792, "blocks": [ { "kind": "heading", "text": "Introduction", "bbox": [72, 680, 400, 700], "level": 1 }, { "kind": "paragraph", "text": "This is the first paragraph...", "bbox": [72, 640, 540, 670] } ], "spans": [ { "text": "Introduction", "bbox": [72, 680, 400, 700], "font": "Times-Bold", "size": 14.0, "confidence": 0.99 } ] } ], "metadata": { "title": "Sample Document", "author": "John Doe", "page_count": 1, "pdf_version": "1.4" } } ``` ### Try Auto-Profile Mode pdftract includes built-in profiles for common document types (invoices, receipts, contracts, etc.). Use `--auto` to automatically detect the profile: ```bash pdftract extract invoice.pdf --auto ``` The auto-detected profile is logged to stderr: ``` [INFO] Detected profile: invoice ``` Profiles optimize extraction for specific document layouts: - **invoice** — Extract line items, totals, vendor info - **receipt** — Extract merchant, date, line items, tax, total - **contract** — Extract parties, effective date, clauses - **bank_statement** — Extract account info, statement period, transactions See [Profiles](./profiles/available.md) for the full list. ### Batch Processing To extract multiple PDFs in a folder: ```bash pdftract extract *.pdf --output-dir results/ ``` Each PDF produces a corresponding JSON file in `results/`: ``` results/ invoice1.pdf.json invoice2.pdf.json receipt.pdf.json ``` For recursive folder processing, use the `grep` command to search across all PDFs: ```bash pdftract grep "search term" /path/to/folder ``` This outputs matching filenames and page numbers: ``` invoice.pdf:3: "search term" found on page 3 receipt.pdf:1: "search term" found on page 1 ``` ## Common Options | Option | Description | |---|---| | `--output FILE` | Write output to file instead of stdout | | `--text` | Output plain text instead of JSON | | `--output-dir DIR` | Directory for batch output (with `*` glob) | | `--auto` | Auto-detect and apply document profile | | `--profile NAME` | Use specific profile (skip auto-detection) | | `--password PASS` | Password for encrypted PDFs | | `--pages N-M` | Extract specific page range | | `--ocr` | Force OCR mode for all pages | See [CLI Reference](./cli/) for complete command documentation. ## What's Next? - Explore the [CLI Reference](./cli/) for advanced options - Read [JSON Schema Reference](./schema/) for output format details - Check [Profiles](./profiles/) for document-type-specific extraction - Try the [Python SDK](./sdk/python.md) for programmatic access ## Troubleshooting **Extraction fails with "unsupported encryption"** The PDF is encrypted with a password. Use `--password`: ```bash pdftract extract encrypted.pdf --password yourpassword ``` **Output has wrong reading order** Some PDFs have malformed internal structure. Try `--auto` to enable profile-based layout recovery, or use `--ocr` to force OCR-based extraction. **Poor accuracy on scanned documents** Ensure the OCR features are enabled. The Docker `:ocr` and `:full` images include Tesseract. If building from source, enable the `ocr` feature: ```bash cargo install pdftract --features ocr ``` For more help, see [Troubleshooting](./troubleshooting/).