Fix two compilation errors at lines 584 and 658 where code was calling .code on &String diagnostics. Replaced d.code.to_string() with direct Vec<String> clone since diagnostics is already Vec<String>. Accepts criteria: - cargo check -p pdftract-cli emits no 'no field code' errors - serve.rs compiles cleanly
14 KiB
This page is auto-generated from the clap command tree. Run
cargo run --manifest-path=xtask/Cargo.toml --bin gen_cli_referenceto regenerate.
CLI Reference
This page provides comprehensive documentation for all pdftract CLI commands and flags.
Usage
pdftract [OPTIONS] <COMMAND>
Global Options
These options are available across all subcommands:
-h, --help- Print help information-V, --version- Print version information
Commands
pdftract
pdftract CLI - PDF extraction and conformance testing
pdftract is a command-line tool for extracting text and structure from PDF files. It supports JSON, Markdown, plain text, and NDJSON output formats, with advanced features like OCR, document classification, and conformance testing.
Usage:
pdftract pdftract
Options:
-
-h, --help- Print help information -
-V, --version- Print version informationextract
Extract text and structure from a PDF file
Extract content from PDF files in multiple formats. Supports local files, remote URLs, and stdin input.
Usage:
pdftract extract
Arguments:
<input>- Path to the PDF file (use '-' for stdin) (required)
Options:
-
--password-stdin- Read password from stdin (one line, terminated by newline) -
--password- PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -
--headerHEADER:VALUE - Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) -
--pages- Page range to extract (1-based, comma-separated: 1-5,7,12-) -
--json- Output JSON to PATH (use '-' for stdout) -
--md- Output Markdown to PATH (use '-' for stdout) -
--text- Output plain text to PATH (use '-' for stdout) -
--ndjson- Output NDJSON to stdout (mutually exclusive with other formats) -
--format- Output formats (comma-separated: json,markdown,text,ndjson) -
-o, --output- Base path for auto-named outputs (used with --format) -
--receipts- Receipt mode: off (default), lite, or svg (default:off) -
--ocr- Enable OCR for scanned pages (requires 'ocr' feature) -
--ocr-language- OCR language codes (comma-separated, e.g., 'eng,fra,deu') -
- Enable cache at this directory (creates if absent)--cache-dir -
--cache-size- Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) (default:1 GiB) -
--no-cache- Disable cache for this extraction (even if --cache-dir is set) -
--md-anchors- Emit HTML comment anchors before each block in Markdown output -
--auto- Auto-detect document type and apply appropriate profile -
--profile<NAME|PATH> - Force-apply a specific profile (by name or YAML file path) -
--include-headers- Include header blocks in output -
--include-footers- Include footer blocks in output -
--include-headers-footers- Include both header and footer blocks in output -
--include-invisible-text- Include invisible text spans in output (rendering_mode == 3) -
--include-hidden-layers- Include hidden-layer text spans in output (OCG-controlled) -
--include-watermarks- Include watermark blocks in output (no-op until Phase 7)classify
Classify document type
Runs metadata + signal extraction to classify document type. Not full text extraction - suitable for quick categorization.
Usage:
pdftract classify
Arguments:
<input>- Path to the PDF file (required)
Options:
-
--password-stdin- Read password from stdin (one line, terminated by newline) -
--password- PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -
- Directory containing custom profile YAML files--profiles -
--pretty- Pretty-print JSON output -
--top-k- Number of top reasons to include (default: all) (default:0) -
--exit-on-unknown- Exit with code 1 if document type is unknowngrep
Search for text patterns in PDF files
Search for text patterns with bounding-box results. Requires the 'grep' feature flag.
Usage:
pdftract grep
Arguments:
<pattern>- Regular expression pattern to search for (required)<paths>- PDF files or directories to search (required)
Options:
-
-C, --context- Number of context lines to show (default:0) -
-i, --ignore-case- Case-insensitive search -
--json- Output results as JSONinspect
Inspect a PDF file in a local web browser
Launch a local web server with debugging overlays for PDF inspection. Provides visual feedback on extraction accuracy and layout analysis. Requires the 'inspect' feature flag.
Usage:
pdftract inspect
Arguments:
<input>- Path to the PDF file (required)
Options:
-
-b, --bind- Bind address for the inspector server (use 0.0.0.0:0 for accessibility from other devices) (default:127.0.0.1:0) -
--password- PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -
--ocr- Enable OCR for scanned pages (requires 'ocr' feature) -
--no-browser- Don't automatically open browserserve
Start the HTTP server for extraction
Start an HTTP server for PDF extraction via REST API.
Security Model: pdftract serve has no built-in authentication. Deploy behind a reverse proxy (nginx, Traefik, Caddy) for production use.
Endpoints:
- POST /extract - Extract PDF and return JSON with metadata
- POST /extract/text - Extract PDF and return plain text
- POST /extract/stream - Extract PDF and return streaming NDJSON
- GET /health - Health check
Requires the 'serve' feature flag.
Usage:
pdftract serve
Options:
-
-b, --bind- Bind address (e.g., "127.0.0.1:8080", "[::1]:9000", "0.0.0.0:3000") (default:127.0.0.1:8080) -
- Enable cache at this directory--cache-dir -
--cache-size- Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) (default:1 GiB) -
--no-cache- Disable cache -
--max-upload-mb- Maximum request body size in MB (default: 256, max: 4096) (default:256) -
--max-decompress-gb- Maximum decompression size in GB (default: 1) (default:1) -
--audit-log- Write per-request audit log to FILE (NDJSON; use "-" for stdout) -
--trust-forwarded-for- Trust X-Forwarded-For header for client IP detection (DANGER: enables IP spoofing if not behind a trusted proxy) -
- Directory containing custom profile YAML files (repeatable)--profile-dir -
--profile-hot-reload- Enable hot-reload for profiles (re-read directory on every request)mcp
Start the MCP (Model Context Protocol) server
Start an MCP server for AI assistant integration.
Per ADR-006: stdio and HTTP transports are mutually exclusive. Exactly one transport must be selected per invocation.
Requires the 'mcp' feature flag.
Usage:
pdftract mcp
Options:
-
--stdio- Use stdio transport (for Claude Desktop, Claude Code, Continue, Cursor) -
-b, --bind- Bind address for the MCP server (enables HTTP+SSE transport) -
--auth-token-file- Path to a file containing the bearer token (RECOMMENDED) -
--auth-token- Bearer token for authentication (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_TOKEN=1) -
--max-upload-mb- Maximum request body size in MB (default: 256) (default:256) -
- Root directory for local filesystem access (enforces path-traversal protection)--root -
--audit-log- Write per-request audit log to FILE (NDJSON; use "-" for stdout)cache
Manage the extraction cache
Manage the content-addressed extraction cache. Cache entries are stored by PDF hash and version constraint. Requires the 'cache' feature flag.
Usage:
pdftract cache
#### `stats`
Show cache statistics
Usage:
pdftract stats
Arguments:
<dir>- Path to the cache directory (required)
Options:
-
--json- Output in JSON formatclear
Clear all cache entries
Clear all cache entries (preserves index.json and sentinel)
Usage:
pdftract clear
Arguments:
<dir>- Path to the cache directory (required)
Options:
-
-y, --yes- Skip confirmation promptpurge
Purge old cache entries
Usage:
pdftract purge
Arguments:
<dir>- Path to the cache directory (required)
Options:
-
--older-than- Delete entries older than this duration (e.g., "30d", "7d", "1h") -
--version- Delete entries matching this version constraint (e.g., "<1.0.0")profiles
Manage document type profiles
Manage document type profiles for classification and extraction tuning. Requires the 'profiles' feature flag.
Usage:
pdftract profiles
#### `list`
List all available profiles
Usage:
pdftract list
#### `show`
Show a profile's YAML content
Usage:
pdftract show
Arguments:
-
<name_or_path>- Profile name or path to YAML file (required)export
Export a built-in profile to stdout
Usage:
pdftract export
Arguments:
-
<name>- Name of the built-in profile to export (required)install
Install a profile to the user config directory
Usage:
pdftract install
Arguments:
-
<path>- Path to the profile YAML file to install (required)validate
Validate a profile file
Usage:
pdftract validate
Arguments:
-
<path>- Path to the profile YAML file to validate (required)doctor
Check environment health and dependencies
Run environment health checks for pdftract dependencies and configuration.
Exit code policy:
- Exits 0 if no checks FAIL (WARN does not affect exit code)
- Exits 1 if any check FAILs
- Exits 2 on argument parse errors
Usage:
pdftract doctor
Options:
-
--features- Print compiled features and exit -
--json- Output results as JSON -
--no-color- Disable colored output -
--exit-on-fail- Explicit form of the default policy (exit 1 if any check FAILs) -
- Verify the profile search path includes DIR--profile-dir -
- Verify DIR is writable and has sufficient space--cache-dir -
--lang- Requested OCR languages (default: eng)hash
Compute the PDF structural fingerprint
Compute a structural hash/fingerprint of a PDF file. This hash is based on the PDF's structure (xref, trailers, object locations) rather than content, making it useful for identifying identical documents with different metadata.
Usage:
pdftract hash
Arguments:
<input>- Path to the PDF file or URL (required)
Options:
-
--password- PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -
--headerHEADER:VALUE - Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE)verify-receipt
Verify a receipt against a PDF file
Verify a visual citation receipt against the original PDF. Checks that quoted text appears at the expected locations. Requires the 'receipts' feature flag.
Usage:
pdftract verify-receipt
Arguments:
<receipt>- Path to the receipt JSON file (required)
Options:
-
--pdf- Path to the original PDF file -
--tolerance- Tolerance for bounding box matching in pixels (default:10) -
--json- Output results as JSONconformance
Run SDK conformance test suite
Usage:
pdftract conformance
Options:
-
-s, --suite- Path to the conformance suite JSON (default:tests/sdk-conformance/cases.json) -
-k, --sdk- SDK name (default:pdftract) -
-v, --version- SDK version (default:0.1.0) -
-o, --output- Output report path (default:conformance-report.json)compare
Compare actual results against expected values
Compare actual extraction results against expected values with tolerances. Used for conformance testing and validation.
Usage:
pdftract compare
Arguments:
<actual>- Path to the actual results JSON (required)<expected>- Path to the expected results JSON (required)
Options:
-
-t, --tolerances- Path to the tolerances JSON (optional) -
-f, --format- Output format (text, json) (default:text)sdk
SDK code generation commands
Usage:
pdftract sdk
#### `codegen`
Generate SDK skeleton from templates
Usage:
pdftract codegen
Options:
-
-l, --lang- Target language -
- Output directory-o, --out -
-v, --version- Version string (defaults to current pdftract version) (default:0.1.0)validate
Validate existing SDK against current generator output
Usage:
pdftract validate
Options:
-
-l, --lang- Target language -
- Path to existing SDK directory-d, --sdk-dirlist-diagnostics
List all diagnostic codes with their metadata
List all diagnostic codes emitted during PDF parsing and extraction. Each diagnostic includes severity, recoverable flag, phase origin, and suggested action.
Usage:
pdftract list-diagnostics
explain-diagnostic
Explain a specific diagnostic code in detail
Usage:
pdftract explain-diagnostic
Arguments:
<code>- Diagnostic code to explain (e.g., STRUCT_MISSING_KEY, STREAM_BOMB) (required)
Hand-Curated Content
Note: Any content added after this marker will be preserved when the CLI reference is regenerated. This section is for additional context that doesn't fit in the auto-generated sections.
Common Patterns
Basic Extraction
pdftract extract document.pdf
JSON Output
pdftract extract --json output.json document.pdf
Markdown with Anchors
pdftract extract --md-anchors --md output.md document.pdf
Exit Codes
0: Success1: General error (extraction failed, file not found, etc.)2: Usage error (invalid arguments, conflicting flags)3: Decryption error (wrong or missing password)