> This page is auto-generated from the clap command tree. > Run `cargo run --manifest-path=xtask/Cargo.toml --bin gen_cli_reference` to regenerate. # CLI Reference This page provides comprehensive documentation for all pdftract CLI commands and flags. ## Usage ```bash pdftract [OPTIONS] ``` ## Global Options These options are available across all subcommands: - `-h, --help` - Print help information - `-V, --version` - Print version information ## Commands ### `pdftract` pdftract CLI - PDF extraction and conformance testing pdftract is a command-line tool for extracting text and structure from PDF files. It supports JSON, Markdown, plain text, and NDJSON output formats, with advanced features like OCR, document classification, and conformance testing. **Usage:** ```bash pdftract pdftract ``` **Options:** - `-h, --help` - Print help information - `-V, --version` - Print version information #### `extract` Extract text and structure from a PDF file Extract content from PDF files in multiple formats. Supports local files, remote URLs, and stdin input. **Usage:** ```bash pdftract extract ``` **Arguments:** - `` - Path to the PDF file (use '-' for stdin) (required) **Options:** - `--password-stdin` - Read password from stdin (one line, terminated by newline) - `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) - `--header` - Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) - `--pages` - Page range to extract (1-based, comma-separated: 1-5,7,12-) - `--json` - Output JSON to PATH (use '-' for stdout) - `--md` - Output Markdown to PATH (use '-' for stdout) - `--text` - Output plain text to PATH (use '-' for stdout) - `--ndjson` - Output NDJSON to stdout (mutually exclusive with other formats) - `--format` - Output formats (comma-separated: json,markdown,text,ndjson) - `-o, --output` - Base path for auto-named outputs (used with --format) - `--receipts` - Receipt mode: off (default), lite, or svg (default: `off`) - `--ocr` - Enable OCR for scanned pages (requires 'ocr' feature) - `--ocr-language` - OCR language codes (comma-separated, e.g., 'eng,fra,deu') - `--cache-dir` - Enable cache at this directory (creates if absent) - `--cache-size` - Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) (default: `1 GiB`) - `--no-cache` - Disable cache for this extraction (even if --cache-dir is set) - `--md-anchors` - Emit HTML comment anchors before each block in Markdown output - `--auto` - Auto-detect document type and apply appropriate profile - `--profile` - Force-apply a specific profile (by name or YAML file path) - `--include-headers` - Include header blocks in output - `--include-footers` - Include footer blocks in output - `--include-headers-footers` - Include both header and footer blocks in output - `--include-invisible-text` - Include invisible text spans in output (rendering_mode == 3) - `--include-hidden-layers` - Include hidden-layer text spans in output (OCG-controlled) - `--include-watermarks` - Include watermark blocks in output (no-op until Phase 7) #### `classify` Classify document type Runs metadata + signal extraction to classify document type. Not full text extraction - suitable for quick categorization. **Usage:** ```bash pdftract classify ``` **Arguments:** - `` - Path to the PDF file (required) **Options:** - `--password-stdin` - Read password from stdin (one line, terminated by newline) - `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) - `--profiles` - Directory containing custom profile YAML files - `--pretty` - Pretty-print JSON output - `--top-k` - Number of top reasons to include (default: all) (default: `0`) - `--exit-on-unknown` - Exit with code 1 if document type is unknown #### `grep` Search for text patterns in PDF files Search for text patterns with bounding-box results. Requires the 'grep' feature flag. **Usage:** ```bash pdftract grep ``` **Arguments:** - `` - Regular expression pattern to search for (required) - `` - PDF files or directories to search (required) **Options:** - `-C, --context` - Number of context lines to show (default: `0`) - `-i, --ignore-case` - Case-insensitive search - `--json` - Output results as JSON #### `inspect` Inspect a PDF file in a local web browser Launch a local web server with debugging overlays for PDF inspection. Provides visual feedback on extraction accuracy and layout analysis. Requires the 'inspect' feature flag. **Usage:** ```bash pdftract inspect ``` **Arguments:** - `` - Path to the PDF file (required) **Options:** - `-b, --bind` - Bind address for the inspector server (use 0.0.0.0:0 for accessibility from other devices) (default: `127.0.0.1:0`) - `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) - `--ocr` - Enable OCR for scanned pages (requires 'ocr' feature) - `--no-browser` - Don't automatically open browser #### `serve` Start the HTTP server for extraction Start an HTTP server for PDF extraction via REST API. **Security Model:** pdftract serve has no built-in authentication. Deploy behind a reverse proxy (nginx, Traefik, Caddy) for production use. **Endpoints:** - POST /extract - Extract PDF and return JSON with metadata - POST /extract/text - Extract PDF and return plain text - POST /extract/stream - Extract PDF and return streaming NDJSON - GET /health - Health check Requires the 'serve' feature flag. **Usage:** ```bash pdftract serve ``` **Options:** - `-b, --bind` - Bind address (e.g., "127.0.0.1:8080", "[::1]:9000", "0.0.0.0:3000") (default: `127.0.0.1:8080`) - `--cache-dir` - Enable cache at this directory - `--cache-size` - Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) (default: `1 GiB`) - `--no-cache` - Disable cache - `--max-upload-mb` - Maximum request body size in MB (default: 256, max: 4096) (default: `256`) - `--max-decompress-gb` - Maximum decompression size in GB (default: 1) (default: `1`) - `--audit-log` - Write per-request audit log to FILE (NDJSON; use "-" for stdout) - `--trust-forwarded-for` - Trust X-Forwarded-For header for client IP detection (DANGER: enables IP spoofing if not behind a trusted proxy) - `--profile-dir` - Directory containing custom profile YAML files (repeatable) - `--profile-hot-reload` - Enable hot-reload for profiles (re-read directory on every request) #### `mcp` Start the MCP (Model Context Protocol) server Start an MCP server for AI assistant integration. Per ADR-006: stdio and HTTP transports are mutually exclusive. Exactly one transport must be selected per invocation. Requires the 'mcp' feature flag. **Usage:** ```bash pdftract mcp ``` **Options:** - `--stdio` - Use stdio transport (for Claude Desktop, Claude Code, Continue, Cursor) - `-b, --bind` - Bind address for the MCP server (enables HTTP+SSE transport) - `--auth-token-file` - Path to a file containing the bearer token (RECOMMENDED) - `--auth-token` - Bearer token for authentication (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_TOKEN=1) - `--max-upload-mb` - Maximum request body size in MB (default: 256) (default: `256`) - `--root` - Root directory for local filesystem access (enforces path-traversal protection) - `--audit-log` - Write per-request audit log to FILE (NDJSON; use "-" for stdout) #### `cache` Manage the extraction cache Manage the content-addressed extraction cache. Cache entries are stored by PDF hash and version constraint. Requires the 'cache' feature flag. **Usage:** ```bash pdftract cache ``` #### `stats` Show cache statistics **Usage:** ```bash pdftract stats ``` **Arguments:** - `` - Path to the cache directory (required) **Options:** - `--json` - Output in JSON format #### `clear` Clear all cache entries Clear all cache entries (preserves index.json and sentinel) **Usage:** ```bash pdftract clear ``` **Arguments:** - `` - Path to the cache directory (required) **Options:** - `-y, --yes` - Skip confirmation prompt #### `purge` Purge old cache entries **Usage:** ```bash pdftract purge ``` **Arguments:** - `` - Path to the cache directory (required) **Options:** - `--older-than` - Delete entries older than this duration (e.g., "30d", "7d", "1h") - `--version` - Delete entries matching this version constraint (e.g., "<1.0.0") #### `profiles` Manage document type profiles Manage document type profiles for classification and extraction tuning. Requires the 'profiles' feature flag. **Usage:** ```bash pdftract profiles ``` #### `list` List all available profiles **Usage:** ```bash pdftract list ``` #### `show` Show a profile's YAML content **Usage:** ```bash pdftract show ``` **Arguments:** - `` - Profile name or path to YAML file (required) #### `export` Export a built-in profile to stdout **Usage:** ```bash pdftract export ``` **Arguments:** - `` - Name of the built-in profile to export (required) #### `install` Install a profile to the user config directory **Usage:** ```bash pdftract install ``` **Arguments:** - `` - Path to the profile YAML file to install (required) #### `validate` Validate a profile file **Usage:** ```bash pdftract validate ``` **Arguments:** - `` - Path to the profile YAML file to validate (required) #### `doctor` Check environment health and dependencies Run environment health checks for pdftract dependencies and configuration. Exit code policy: - Exits 0 if no checks FAIL (WARN does not affect exit code) - Exits 1 if any check FAILs - Exits 2 on argument parse errors **Usage:** ```bash pdftract doctor ``` **Options:** - `--features` - Print compiled features and exit - `--json` - Output results as JSON - `--no-color` - Disable colored output - `--exit-on-fail` - Explicit form of the default policy (exit 1 if any check FAILs) - `--profile-dir` - Verify the profile search path includes DIR - `--cache-dir` - Verify DIR is writable and has sufficient space - `--lang` - Requested OCR languages (default: eng) #### `hash` Compute the PDF structural fingerprint Compute a structural hash/fingerprint of a PDF file. This hash is based on the PDF's structure (xref, trailers, object locations) rather than content, making it useful for identifying identical documents with different metadata. **Usage:** ```bash pdftract hash ``` **Arguments:** - `` - Path to the PDF file or URL (required) **Options:** - `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) - `--header` - Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) #### `verify-receipt` Verify a receipt against a PDF file Verify a visual citation receipt against the original PDF. Checks that quoted text appears at the expected locations. Requires the 'receipts' feature flag. **Usage:** ```bash pdftract verify-receipt ``` **Arguments:** - `` - Path to the receipt JSON file (required) **Options:** - `--pdf` - Path to the original PDF file - `--tolerance` - Tolerance for bounding box matching in pixels (default: `10`) - `--json` - Output results as JSON #### `conformance` Run SDK conformance test suite **Usage:** ```bash pdftract conformance ``` **Options:** - `-s, --suite` - Path to the conformance suite JSON (default: `tests/sdk-conformance/cases.json`) - `-k, --sdk` - SDK name (default: `pdftract`) - `-v, --version` - SDK version (default: `0.1.0`) - `-o, --output` - Output report path (default: `conformance-report.json`) #### `compare` Compare actual results against expected values Compare actual extraction results against expected values with tolerances. Used for conformance testing and validation. **Usage:** ```bash pdftract compare ``` **Arguments:** - `` - Path to the actual results JSON (required) - `` - Path to the expected results JSON (required) **Options:** - `-t, --tolerances` - Path to the tolerances JSON (optional) - `-f, --format` - Output format (text, json) (default: `text`) #### `sdk` SDK code generation commands **Usage:** ```bash pdftract sdk ``` #### `codegen` Generate SDK skeleton from templates **Usage:** ```bash pdftract codegen ``` **Options:** - `-l, --lang` - Target language - `-o, --out` - Output directory - `-v, --version` - Version string (defaults to current pdftract version) (default: `0.1.0`) #### `validate` Validate existing SDK against current generator output **Usage:** ```bash pdftract validate ``` **Options:** - `-l, --lang` - Target language - `-d, --sdk-dir` - Path to existing SDK directory #### `list-diagnostics` List all diagnostic codes with their metadata List all diagnostic codes emitted during PDF parsing and extraction. Each diagnostic includes severity, recoverable flag, phase origin, and suggested action. **Usage:** ```bash pdftract list-diagnostics ``` #### `explain-diagnostic` Explain a specific diagnostic code in detail **Usage:** ```bash pdftract explain-diagnostic ``` **Arguments:** - `` - Diagnostic code to explain (e.g., STRUCT_MISSING_KEY, STREAM_BOMB) (required) ## Hand-Curated Content > **Note:** Any content added after this marker will be preserved > when the CLI reference is regenerated. This section is for > additional context that doesn't fit in the auto-generated sections. ### Common Patterns #### Basic Extraction ```bash pdftract extract document.pdf ``` #### JSON Output ```bash pdftract extract --json output.json document.pdf ``` #### Markdown with Anchors ```bash pdftract extract --md-anchors --md output.md document.pdf ``` ### Exit Codes - `0`: Success - `1`: General error (extraction failed, file not found, etc.) - `2`: Usage error (invalid arguments, conflicting flags) - `3`: Decryption error (wrong or missing password)