diff --git a/docs/user-docs/src/cli-reference.md b/docs/user-docs/src/cli-reference.md index b0e572b..d537a32 100644 --- a/docs/user-docs/src/cli-reference.md +++ b/docs/user-docs/src/cli-reference.md @@ -1,593 +1,619 @@ -> This page is auto-generated from the clap command tree. -> Run `cargo run --manifest-path=xtask/Cargo.toml --bin gen_cli_reference` to regenerate. - # CLI Reference -This page provides comprehensive documentation for all pdftract CLI commands and flags. +> This page is auto-generated from the clap command tree. +> Run `cargo run --bin gen-cli-reference` to regenerate. -## Usage +# Command-Line Help for `pdftract` -```bash -pdftract [OPTIONS] -``` +This document contains the help content for the `pdftract` command-line program. -## Global Options +**Command Overview:** -These options are available across all subcommands: +* [`pdftract`↴](#pdftract) +* [`pdftract list-diagnostics`↴](#pdftract-list-diagnostics) +* [`pdftract explain-diagnostic`↴](#pdftract-explain-diagnostic) +* [`pdftract compare`↴](#pdftract-compare) +* [`pdftract conformance`↴](#pdftract-conformance) +* [`pdftract sdk`↴](#pdftract-sdk) +* [`pdftract sdk codegen`↴](#pdftract-sdk-codegen) +* [`pdftract sdk validate`↴](#pdftract-sdk-validate) +* [`pdftract extract`↴](#pdftract-extract) +* [`pdftract classify`↴](#pdftract-classify) +* [`pdftract inspect`↴](#pdftract-inspect) +* [`pdftract verify-receipt`↴](#pdftract-verify-receipt) +* [`pdftract hash`↴](#pdftract-hash) +* [`pdftract cache`↴](#pdftract-cache) +* [`pdftract cache stats`↴](#pdftract-cache-stats) +* [`pdftract cache clear`↴](#pdftract-cache-clear) +* [`pdftract cache purge`↴](#pdftract-cache-purge) +* [`pdftract profiles`↴](#pdftract-profiles) +* [`pdftract profiles list`↴](#pdftract-profiles-list) +* [`pdftract profiles show`↴](#pdftract-profiles-show) +* [`pdftract profiles export`↴](#pdftract-profiles-export) +* [`pdftract profiles install`↴](#pdftract-profiles-install) +* [`pdftract profiles validate`↴](#pdftract-profiles-validate) +* [`pdftract serve`↴](#pdftract-serve) +* [`pdftract mcp`↴](#pdftract-mcp) +* [`pdftract validate`↴](#pdftract-validate) +* [`pdftract migrate-schema`↴](#pdftract-migrate-schema) +* [`pdftract doctor`↴](#pdftract-doctor) -- `-h, --help` - Print help information -- `-V, --version` - Print version information - -## Commands - -### `pdftract` +## `pdftract` pdftract CLI - PDF extraction and conformance testing -pdftract is a command-line tool for extracting text and structure from PDF files. -It supports JSON, Markdown, plain text, and NDJSON output formats, with -advanced features like OCR, document classification, and conformance testing. - -**Usage:** - -```bash -pdftract pdftract -``` - -**Options:** - -- `-h, --help` - Print help information -- `-V, --version` - Print version information - - #### `extract` - -Extract text and structure from a PDF file - -Extract content from PDF files in multiple formats. -Supports local files, remote URLs, and stdin input. - -**Usage:** - -```bash -pdftract extract -``` - -**Arguments:** - -- `` - Path to the PDF file (use '-' for stdin) (required) - -**Options:** - -- `--password-stdin` - Read password from stdin (one line, terminated by newline) -- `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -- `--header` - Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) -- `--pages` - Page range to extract (1-based, comma-separated: 1-5,7,12-) -- `--json` - Output JSON to PATH (use '-' for stdout) -- `--md` - Output Markdown to PATH (use '-' for stdout) -- `--text` - Output plain text to PATH (use '-' for stdout) -- `--ndjson` - Output NDJSON to stdout (mutually exclusive with other formats) -- `--format` - Output formats (comma-separated: json,markdown,text,ndjson) -- `-o, --output` - Base path for auto-named outputs (used with --format) -- `--receipts` - Receipt mode: off (default), lite, or svg (default: `off`) -- `--ocr` - Enable OCR for scanned pages (requires 'ocr' feature) -- `--ocr-language` - OCR language codes (comma-separated, e.g., 'eng,fra,deu') -- `--cache-dir` - Enable cache at this directory (creates if absent) -- `--cache-size` - Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) (default: `1 GiB`) -- `--no-cache` - Disable cache for this extraction (even if --cache-dir is set) -- `--md-anchors` - Emit HTML comment anchors before each block in Markdown output -- `--md-no-page-breaks` - Suppress page-break horizontal rules between pages -- `--auto` - Auto-detect document type and apply appropriate profile -- `--profile` - Force-apply a specific profile (by name or YAML file path) -- `--include-headers` - Include header blocks in output -- `--include-footers` - Include footer blocks in output -- `--include-headers-footers` - Include both header and footer blocks in output -- `--include-invisible-text` - Include invisible text spans in output (rendering_mode == 3) -- `--include-hidden-layers` - Include hidden-layer text spans in output (OCG-controlled) -- `--include-watermarks` - Include watermark blocks in output (no-op until Phase 7) - - #### `classify` - -Classify document type - -Runs metadata + signal extraction to classify document type. -Not full text extraction - suitable for quick categorization. - -**Usage:** - -```bash -pdftract classify -``` - -**Arguments:** - -- `` - Path to the PDF file (required) - -**Options:** - -- `--password-stdin` - Read password from stdin (one line, terminated by newline) -- `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -- `--profiles` - Directory containing custom profile YAML files -- `--pretty` - Pretty-print JSON output -- `--top-k` - Number of top reasons to include (default: all) (default: `0`) -- `--exit-on-unknown` - Exit with code 1 if document type is unknown - - #### `grep` - -Search for text patterns in PDF files - -Search for text patterns with bounding-box results. -Requires the 'grep' feature flag. - -**Usage:** - -```bash -pdftract grep -``` - -**Arguments:** - -- `` - Regular expression pattern to search for (required) -- `` - PDF files or directories to search (required) - -**Options:** - -- `-C, --context` - Number of context lines to show (default: `0`) -- `-i, --ignore-case` - Case-insensitive search -- `--json` - Output results as JSON - - #### `inspect` - -Inspect a PDF file in a local web browser - -Launch a local web server with debugging overlays for PDF inspection. -Provides visual feedback on extraction accuracy and layout analysis. -Requires the 'inspect' feature flag. - -**Usage:** - -```bash -pdftract inspect -``` - -**Arguments:** - -- `` - Path to the PDF file to inspect (required) - -**Options:** - -- `-p, --port` - Port to bind the inspector server (default: 7676) (default: `7676`) -- `-b, --bind` - Bind address for the inspector server (default: 127.0.0.1) (default: `127.0.0.1`) -- `--auth-token` - Authentication token for non-loopback binds -- `--no-open` - Suppress automatic browser launch -- `--compare` - Optional second PDF file for comparative debugging -- `--audit-log` - Write per-request audit log to FILE (NDJSON; use "-" for stdout) - - #### `serve` - -Start the HTTP server for extraction - -Start an HTTP server for PDF extraction via REST API. - -**Security Model:** pdftract serve has no built-in authentication. Deploy behind a reverse proxy (nginx, Traefik, Caddy) for production use. - -**Endpoints:** -- POST /extract - Extract PDF and return JSON with metadata -- POST /extract/text - Extract PDF and return plain text -- POST /extract/stream - Extract PDF and return streaming NDJSON -- GET /health - Health check - -Requires the 'serve' feature flag. - -**Usage:** - -```bash -pdftract serve -``` - -**Options:** - -- `-b, --bind` - Bind address (e.g., "127.0.0.1:8080", "[::1]:9000", "0.0.0.0:3000") (default: `127.0.0.1:8080`) -- `--cache-dir` - Enable cache at this directory -- `--cache-size` - Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) (default: `1 GiB`) -- `--no-cache` - Disable cache -- `--max-upload-mb` - Maximum request body size in MB (default: 256, max: 4096) (default: `256`) -- `--max-decompress-gb` - Maximum decompression size in GB (default: 1) (default: `1`) -- `--audit-log` - Write per-request audit log to FILE (NDJSON; use "-" for stdout) -- `--trust-forwarded-for` - Trust X-Forwarded-For header for client IP detection (DANGER: enables IP spoofing if not behind a trusted proxy) -- `--profile-dir` - Directory containing custom profile YAML files (repeatable) -- `--profile-hot-reload` - Enable hot-reload for profiles (re-read directory on every request) - - #### `mcp` - -Start the MCP (Model Context Protocol) server - -Start an MCP server for AI assistant integration. - -Per ADR-006: stdio and HTTP transports are mutually exclusive. -Exactly one transport must be selected per invocation. - -Requires the 'mcp' feature flag. - -**Usage:** - -```bash -pdftract mcp -``` - -**Options:** - -- `--stdio` - Use stdio transport (for Claude Desktop, Claude Code, Continue, Cursor) -- `-b, --bind` - Bind address for the MCP server (enables HTTP+SSE transport) -- `--auth-token-file` - Path to a file containing the bearer token (RECOMMENDED) -- `--auth-token` - Bearer token for authentication (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_TOKEN=1) -- `--max-upload-mb` - Maximum request body size in MB (default: 256) (default: `256`) -- `--root` - Root directory for local filesystem access (enforces path-traversal protection) -- `--audit-log` - Write per-request audit log to FILE (NDJSON; use "-" for stdout) - - #### `cache` - -Manage the extraction cache - -Manage the content-addressed extraction cache. -Cache entries are stored by PDF hash and version constraint. -Requires the 'cache' feature flag. - -**Usage:** - -```bash -pdftract cache -``` - - #### `stats` - -Show cache statistics - -**Usage:** - -```bash -pdftract stats -``` - -**Arguments:** - -- `` - Path to the cache directory (required) - -**Options:** - -- `--json` - Output in JSON format - - #### `clear` - -Clear all cache entries - -Clear all cache entries (preserves index.json and sentinel) - -**Usage:** - -```bash -pdftract clear -``` - -**Arguments:** - -- `` - Path to the cache directory (required) - -**Options:** - -- `-y, --yes` - Skip confirmation prompt - - #### `purge` - -Purge old cache entries - -**Usage:** - -```bash -pdftract purge -``` - -**Arguments:** - -- `` - Path to the cache directory (required) - -**Options:** - -- `--older-than` - Delete entries older than this duration (e.g., "30d", "7d", "1h") -- `--version` - Delete entries matching this version constraint (e.g., "<1.0.0") - - #### `profiles` - -Manage document type profiles - -Manage document type profiles for classification and extraction tuning. -Requires the 'profiles' feature flag. - -**Usage:** - -```bash -pdftract profiles -``` - - #### `list` - -List all available profiles - -**Usage:** - -```bash -pdftract list -``` - - #### `show` - -Show a profile's YAML content - -**Usage:** - -```bash -pdftract show -``` - -**Arguments:** - -- `` - Profile name or path to YAML file (required) - - #### `export` - -Export a built-in profile to stdout - -**Usage:** - -```bash -pdftract export -``` - -**Arguments:** - -- `` - Name of the built-in profile to export (required) - - #### `install` - -Install a profile to the user config directory - -**Usage:** - -```bash -pdftract install -``` - -**Arguments:** - -- `` - Path to the profile YAML file to install (required) - - #### `validate` - -Validate a profile file - -**Usage:** - -```bash -pdftract validate -``` - -**Arguments:** - -- `` - Path to the profile YAML file to validate (required) - - #### `doctor` - -Check environment health and dependencies - -Run environment health checks for pdftract dependencies and configuration. - -Exit code policy: -- Exits 0 if no checks FAIL (WARN does not affect exit code) -- Exits 1 if any check FAILs -- Exits 2 on argument parse errors - -**Usage:** - -```bash -pdftract doctor -``` - -**Options:** - -- `--features` - Print compiled features and exit -- `--json` - Output results as JSON -- `--no-color` - Disable colored output -- `--exit-on-fail` - Explicit form of the default policy (exit 1 if any check FAILs) -- `--profile-dir` - Verify the profile search path includes DIR -- `--cache-dir` - Verify DIR is writable and has sufficient space -- `--lang` - Requested OCR languages (default: eng) - - #### `hash` - -Compute the PDF structural fingerprint - -Compute a structural hash/fingerprint of a PDF file. -This hash is based on the PDF's structure (xref, trailers, object -locations) rather than content, making it useful for identifying -identical documents with different metadata. - -**Usage:** - -```bash -pdftract hash -``` - -**Arguments:** - -- `` - Path to the PDF file or URL (required) - -**Options:** - -- `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -- `--header` - Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) - - #### `verify-receipt` - -Verify a receipt against a PDF file - -Verify a visual citation receipt against the original PDF. -Checks fingerprint, bbox IoU, and content hash. -Requires the 'receipts' feature flag. - -**Usage:** - -```bash -pdftract verify-receipt -``` - -**Arguments:** - -- `` - Path to the PDF file to verify against (required) -- `` - Path to the receipt JSON file, or "-" for stdin (required) - -**Options:** - -- `--stdin` - Read receipt from stdin (alternative to "-") -- `--inline` - Receipt JSON as inline string (alternative to file path) -- `--json` - Output machine-readable JSON result -- `--quiet` - Suppress human-readable output (exit code only) -- `--password` - PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) -- `--password-stdin` - Read password from stdin (one line, terminated by newline) - - #### `conformance` - -Run SDK conformance test suite - -**Usage:** - -```bash -pdftract conformance -``` - -**Options:** - -- `-s, --suite` - Path to the conformance suite JSON (default: `tests/sdk-conformance/cases.json`) -- `-k, --sdk` - SDK name (default: `pdftract`) -- `-v, --version` - SDK version (default: `0.1.0`) -- `-o, --output` - Output report path (default: `conformance-report.json`) - - #### `compare` - -Compare actual results against expected values - -Compare actual extraction results against expected values with tolerances. -Used for conformance testing and validation. - -**Usage:** - -```bash -pdftract compare -``` - -**Arguments:** - -- `` - Path to the actual results JSON (required) -- `` - Path to the expected results JSON (required) - -**Options:** - -- `-t, --tolerances` - Path to the tolerances JSON (optional) -- `-f, --format` - Output format (text, json) (default: `text`) - - #### `sdk` - -SDK code generation commands - -**Usage:** - -```bash -pdftract sdk -``` - - #### `codegen` - -Generate SDK skeleton from templates - -**Usage:** - -```bash -pdftract codegen -``` - -**Options:** - -- `-l, --lang` - Target language -- `-o, --out` - Output directory -- `-v, --version` - Version string (defaults to current pdftract version) (default: `0.1.0`) - - #### `validate` - -Validate existing SDK against current generator output - -**Usage:** - -```bash -pdftract validate -``` - -**Options:** - -- `-l, --lang` - Target language -- `-d, --sdk-dir` - Path to existing SDK directory - - #### `migrate-schema` - -Migrate JSON output between schema versions - -Migrate JSON output between schema versions. -Converts JSON from one schema version to another. - -**Usage:** - -```bash -pdftract migrate-schema -``` - -**Arguments:** - -- `` - Input JSON file (use '-' for stdin) - -**Options:** - -- `--from` - Source schema version (e.g., "1.0", "1.1") -- `--to` - Target schema version (e.g., "1.0", "1.1") -- `-o, --output` - Output JSON file (use '-' for stdout) (default: `-`) -- `-p, --pretty` - Pretty-print output JSON - - #### `list-diagnostics` +**Usage:** `pdftract ` + +###### **Subcommands:** + +* `list-diagnostics` — List all diagnostic codes with their metadata +* `explain-diagnostic` — Explain a specific diagnostic code in detail +* `compare` — Compare actual results against expected values with tolerances (for conformance testing) +* `conformance` — Run SDK conformance test suite +* `sdk` — SDK code generation commands +* `extract` — Extract text and structure from a PDF file +* `classify` — Classify document type (runs metadata + signal extraction, not full text extraction) +* `inspect` — Inspect a PDF file in a local web browser with debugging overlays +* `verify-receipt` — Verify a receipt against a PDF file +* `hash` — Compute the PDF structural fingerprint (hash) +* `cache` — Manage the extraction cache +* `profiles` — Manage document type profiles +* `serve` — Start the HTTP server for extraction +* `mcp` — Start the MCP (Model Context Protocol) server +* `validate` — Validate a JSON file against the pdftract schema +* `migrate-schema` — Migrate JSON output between schema versions +* `doctor` — Check environment health and dependencies + + + +## `pdftract list-diagnostics` List all diagnostic codes with their metadata -List all diagnostic codes emitted during PDF parsing and extraction. -Each diagnostic includes severity, recoverable flag, phase origin, -and suggested action. +**Usage:** `pdftract list-diagnostics` -**Usage:** -```bash -pdftract list-diagnostics -``` - #### `explain-diagnostic` +## `pdftract explain-diagnostic` Explain a specific diagnostic code in detail -**Usage:** +**Usage:** `pdftract explain-diagnostic ` -```bash -pdftract explain-diagnostic -``` +###### **Arguments:** -**Arguments:** +* `` — Diagnostic code to explain (e.g., STRUCT_MISSING_KEY, STREAM_BOMB) + + + +## `pdftract compare` + +Compare actual results against expected values with tolerances (for conformance testing) + +**Usage:** `pdftract compare [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the actual results JSON +* `` — Path to the expected results JSON + +###### **Options:** + +* `-t`, `--tolerances ` — Path to the tolerances JSON (optional) +* `-f`, `--format ` — Output format (text, json) + + Default value: `text` + + + +## `pdftract conformance` + +Run SDK conformance test suite + +**Usage:** `pdftract conformance [OPTIONS]` + +###### **Options:** + +* `-s`, `--suite ` — Path to the conformance suite JSON + + Default value: `tests/sdk-conformance/cases.json` +* `-k`, `--sdk ` — SDK name + + Default value: `pdftract` +* `-v`, `--version ` — SDK version + + Default value: `0.1.0` +* `-o`, `--output ` — Output report path + + Default value: `conformance-report.json` + + + +## `pdftract sdk` + +SDK code generation commands + +**Usage:** `pdftract sdk ` + +###### **Subcommands:** + +* `codegen` — Generate SDK skeleton from templates +* `validate` — Validate existing SDK against current generator output + + + +## `pdftract sdk codegen` + +Generate SDK skeleton from templates + +**Usage:** `pdftract sdk codegen --lang --out ` + +###### **Options:** + +* `-l`, `--lang ` — Target language + + Possible values: `python`, `rust`, `node`, `go`, `java`, `dotnet`, `ruby`, `php`, `swift` + +* `-o`, `--out ` — Output directory +* `-v`, `--version ` — Version string (defaults to current pdftract version) + + Default value: `0.1.0` + + + +## `pdftract sdk validate` + +Validate existing SDK against current generator output + +**Usage:** `pdftract sdk validate --lang --sdk-dir ` + +###### **Options:** + +* `-l`, `--lang ` — Target language + + Possible values: `python`, `rust`, `node`, `go`, `java`, `dotnet`, `ruby`, `php`, `swift` + +* `-s`, `--sdk-dir ` — Path to existing SDK directory + + + +## `pdftract extract` + +Extract text and structure from a PDF file + +**Usage:** `pdftract extract [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the PDF file (use '-' for stdin) + +###### **Options:** + +* `--password-stdin` — Read password from stdin (one line, terminated by newline) +* `--password ` — PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) +* `--header ` — Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) +* `--pages ` — Page range to extract (1-based, comma-separated: 1-5,7,12-) +* `--json ` — Output JSON to PATH (use '-' for stdout) +* `--md ` — Output Markdown to PATH (use '-' for stdout) +* `--text ` — Output plain text to PATH (use '-' for stdout) +* `--ndjson` — Output NDJSON to stdout (mutually exclusive with other formats) +* `--format ` — Output formats (comma-separated: json,markdown,text,ndjson) +* `-o`, `--output ` — Base path for auto-named outputs (used with --format) +* `--receipts ` — Receipt mode: off (default), lite, or svg + + Default value: `off` + + Possible values: `off`, `lite`, `svg` + +* `--ocr` — Enable OCR for scanned pages (requires 'ocr' feature) +* `--ocr-language ` — OCR language codes (comma-separated, e.g., 'eng,fra,deu') +* `--cache-dir ` — Enable cache at this directory (creates if absent) +* `--cache-size ` — Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) + + Default value: `1 GiB` +* `--no-cache` — Disable cache for this extraction (even if --cache-dir is set) +* `--md-anchors` — Emit HTML comment anchors before each block in Markdown output +* `--md-no-page-breaks` — Suppress page-break horizontal rules between pages +* `--auto` — Auto-detect document type and apply appropriate profile +* `--profile ` — Force-apply a specific profile (by name or YAML file path) +* `--include-headers` — Include header blocks in output +* `--include-footers` — Include footer blocks in output +* `--include-headers-footers` — Include both header and footer blocks in output +* `--include-invisible-text` — Include invisible text spans in output (rendering_mode == 3) +* `--include-hidden-layers` — Include hidden-layer text spans in output (OCG-controlled) +* `--include-watermarks` — Include watermark blocks in output (no-op until Phase 7) + + + +## `pdftract classify` + +Classify document type (runs metadata + signal extraction, not full text extraction) + +**Usage:** `pdftract classify [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the PDF file + +###### **Options:** + +* `--password-stdin` — Read password from stdin (one line, terminated by newline) +* `--password ` — PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) +* `--profiles ` — Directory containing custom profile YAML files +* `--pretty` — Pretty-print JSON output +* `--top-k ` — Number of top reasons to include (default: all) + + Default value: `0` +* `--exit-on-unknown` — Exit with code 1 if document type is unknown + + + +## `pdftract inspect` + +Inspect a PDF file in a local web browser with debugging overlays + +**Usage:** `pdftract inspect [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the PDF file to inspect + +###### **Options:** + +* `-p`, `--port ` — Port to bind the inspector server (default: 7676) + + Default value: `7676` +* `-b`, `--bind ` — Bind address for the inspector server (default: 127.0.0.1) + + Binding to a non-loopback address requires --auth-token for security. + + Default value: `127.0.0.1` +* `--auth-token ` — Authentication token for non-loopback binds + + Required when --bind is not a loopback address (127.0.0.1 or ::1). +* `--no-open` — Suppress automatic browser launch + + Useful for CI environments or when you want to manually open the browser. +* `--compare ` — Optional second PDF file for comparative debugging + + When provided, the inspector shows side-by-side comparison. +* `--audit-log ` — Write per-request audit log to FILE (NDJSON; use "-" for stdout, "/dev/stderr" for stderr) + + Rotation: pdftract does NOT rotate logs; configure logrotate on the audit-log file. When FILE is "-", rotation is the responsibility of the supervisor (e.g., journald). + + + +## `pdftract verify-receipt` + +Verify a receipt against a PDF file + +**Usage:** `pdftract verify-receipt [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the PDF file to verify against +* `` — Path to the receipt JSON file, or "-" for stdin + +###### **Options:** + +* `--stdin` — Read receipt from stdin (alternative to "-") +* `--inline ` — Receipt JSON as inline string (alternative to file path) +* `--json` — Output machine-readable JSON result +* `--quiet` — Suppress human-readable output (exit code only) +* `--password ` — PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) +* `--password-stdin` — Read password from stdin (one line, terminated by newline) + + + +## `pdftract hash` + +Compute the PDF structural fingerprint (hash) + +**Usage:** `pdftract hash [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the PDF file or URL + +###### **Options:** + +* `--password ` — PDF password (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1) +* `--header ` — Custom HTTP headers for remote sources (repeatable; format: HEADER:VALUE) + + + +## `pdftract cache` + +Manage the extraction cache + +**Usage:** `pdftract cache ` + +###### **Subcommands:** + +* `stats` — Show cache statistics +* `clear` — Clear all cache entries (preserves index.json and sentinel) +* `purge` — Purge old cache entries + + + +## `pdftract cache stats` + +Show cache statistics + +**Usage:** `pdftract cache stats [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the cache directory + +###### **Options:** + +* `--json` — Output in JSON format + + + +## `pdftract cache clear` + +Clear all cache entries (preserves index.json and sentinel) + +**Usage:** `pdftract cache clear [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the cache directory + +###### **Options:** + +* `-y`, `--yes` — Skip confirmation prompt + + + +## `pdftract cache purge` + +Purge old cache entries + +**Usage:** `pdftract cache purge [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the cache directory + +###### **Options:** + +* `--older-than ` — Delete entries older than this duration (e.g., "30d", "7d", "1h") +* `--version ` — Delete entries matching this version constraint (e.g., "<1.0.0") + + + +## `pdftract profiles` + +Manage document type profiles + +**Usage:** `pdftract profiles ` + +###### **Subcommands:** + +* `list` — List all available profiles +* `show` — Show a profile's YAML content +* `export` — Export a built-in profile to stdout +* `install` — Install a profile to the user config directory +* `validate` — Validate a profile file + + + +## `pdftract profiles list` + +List all available profiles + +**Usage:** `pdftract profiles list` + + + +## `pdftract profiles show` + +Show a profile's YAML content + +**Usage:** `pdftract profiles show ` + +###### **Arguments:** + +* `` — Profile name or path to YAML file + + + +## `pdftract profiles export` + +Export a built-in profile to stdout + +**Usage:** `pdftract profiles export ` + +###### **Arguments:** + +* `` — Name of the built-in profile to export + + + +## `pdftract profiles install` + +Install a profile to the user config directory + +**Usage:** `pdftract profiles install ` + +###### **Arguments:** + +* `` — Path to the profile YAML file to install + + + +## `pdftract profiles validate` + +Validate a profile file + +**Usage:** `pdftract profiles validate ` + +###### **Arguments:** + +* `` — Path to the profile YAML file to validate + + + +## `pdftract serve` + +Start the HTTP server for extraction + +## Security Model + +**pdftract serve has no built-in authentication.** Deploy behind a reverse proxy (nginx, Traefik, Caddy) for production use. The server accepts PDFs via multipart upload only; no endpoint accepts file paths from server filesystem. + +## Concurrency + +The server uses a two-level concurrency architecture: + +- **tokio**: Per-request concurrency via the async executor. Each HTTP request is handled asynchronously on tokio's multi-threaded runtime. - **rayon**: Per-document parallelism within each extraction. PDF pages are processed in parallel using rayon's work-stealing thread pool. + +The bridge between async (tokio) and sync (rayon) is `tokio::task::spawn_blocking`. Each POST handler wraps the synchronous extraction call in `spawn_blocking`, which runs the work on tokio's blocking thread pool (separate from the async reactor). + +This design ensures: - The async reactor is never blocked by extraction work - Multiple PDFs can be extracted concurrently (one per request) - Within each PDF, pages are processed in parallel (rayon) - Thread pools are sized appropriately (tokio: 512 blocking threads; rayon: num_cpus) + +## Endpoints + +- `POST /extract` - Extract PDF and return JSON with metadata - `POST /extract/text` - Extract PDF and return plain text - `POST /extract/stream` - Extract PDF and return streaming NDJSON - `GET /health` - Health check (responds within 100ms even during concurrent extractions) + +## Cache + +Cache is optional. When enabled, extracted results are stored on disk and reused for identical PDFs. Cache status is reported via the `X-Pdftract-Cache` response header. + +**Usage:** `pdftract serve [OPTIONS]` + +###### **Options:** + +* `-b`, `--bind ` — Bind address (e.g., "127.0.0.1:8080", "[::1]:9000", "0.0.0.0:3000") + + Default value: `127.0.0.1:8080` +* `--cache-dir ` — Enable cache at this directory +* `--cache-size ` — Set cache size limit (default 1 GiB; accepts KiB, MiB, GiB suffixes) + + Default value: `1 GiB` +* `--no-cache` — Disable cache +* `--max-upload-mb ` — Maximum request body size in MB (default: 256, max: 4096) + + Default value: `256` +* `--max-decompress-gb ` — Maximum decompression size in GB (default: 1, overrides per-request max_decompress_gb) + + Default value: `1` +* `--audit-log ` — Write per-request audit log to FILE (NDJSON; use "-" for stdout, "/dev/stderr" for stderr) + + Rotation: pdftract does NOT rotate logs; configure logrotate on the audit-log file. When FILE is "-", rotation is the responsibility of the supervisor (e.g., journald). +* `--trust-forwarded-for` — Trust X-Forwarded-For header for client IP detection (DANGER: enables IP spoofing if not behind a trusted proxy) +* `--profile-dir ` — Directory containing custom profile YAML files (repeatable) +* `--profile-hot-reload` — Enable hot-reload for profiles (re-read directory on every request) + + + +## `pdftract mcp` + +Start the MCP (Model Context Protocol) server + +Per ADR-006: stdio and HTTP transports are mutually exclusive because they have opposite stdout discipline (stdio: JSON-RPC sink; HTTP: log channel). Exactly one transport must be selected per invocation. + +**Usage:** `pdftract mcp [OPTIONS]` + +###### **Options:** + +* `--stdio` — Use stdio transport (for Claude Desktop, Claude Code, Continue, Cursor) + + This is the default transport mode if neither --stdio nor --bind is specified. +* `-b`, `--bind ` — Bind address for the MCP server (e.g., "127.0.0.1:8080", "[::1]:9000", "0.0.0.0:3000") + + Enables HTTP+SSE transport mode. Mutually exclusive with --stdio. +* `--auth-token-file ` — Path to a file containing the bearer token (RECOMMENDED) +* `--auth-token ` — Bearer token for authentication (INSECURE: rejected unless PDFTRACT_INSECURE_CLI_TOKEN=1) +* `--max-upload-mb ` — Maximum request body size in MB (default: 256) + + Default value: `256` +* `--root ` — Root directory for local filesystem access (enforces path-traversal protection) + + When set, all local-path tool arguments are resolved relative to DIR and any path that escapes DIR is rejected with JSON-RPC error code -32602. HTTPS URLs are not affected by this flag. Without --root, the server runs in trust-the-caller mode (no path-check applied). +* `--audit-log ` — Write per-request audit log to FILE (NDJSON; use "-" for stdout, "/dev/stderr" for stderr) + + Rotation: pdftract does NOT rotate logs; configure logrotate on the audit-log file. When FILE is "-", rotation is the responsibility of the supervisor (e.g., journald). + + + +## `pdftract validate` + +Validate a JSON file against the pdftract schema + +**Usage:** `pdftract validate [OPTIONS] ` + +###### **Arguments:** + +* `` — Path to the JSON file to validate (use '-' for stdin) + +###### **Options:** + +* `-s`, `--schema ` — Path to a custom schema file (default: bundled v1.0 schema) +* `-q`, `--quiet` — Quiet mode - suppress error output (only exit code matters) + + + +## `pdftract migrate-schema` + +Migrate JSON output between schema versions + +**Usage:** `pdftract migrate-schema [OPTIONS] --from --to [INPUT]` + +###### **Arguments:** + +* `` — Input JSON file (use '-' for stdin) + + Default value: `-` + +###### **Options:** + +* `--from ` — Source schema version (e.g., "1.0", "1.1") +* `--to ` — Target schema version (e.g., "1.0", "1.1") +* `-o`, `--output ` — Output JSON file (use '-' for stdout) + + Default value: `-` +* `-p`, `--pretty` — Pretty-print output JSON + + + +## `pdftract doctor` + +Check environment health and dependencies + +Exit code policy: exits 0 if no checks FAIL (WARN does not affect exit code); exits 1 if any check FAILs; exits 2 on argument parse errors. + +**Usage:** `pdftract doctor [OPTIONS]` + +###### **Options:** + +* `--features` — Print compiled features and exit +* `--json` — Output results as JSON +* `--no-color` — Disable colored output +* `--exit-on-fail` — Explicit form of the default policy (exit 1 if any check FAILs). + + This flag is the default behavior and is provided for CI script readability. WARN does not affect exit code regardless of this flag. +* `--profile-dir ` — Verify the profile search path includes DIR +* `--cache-dir ` — Verify DIR is writable and has sufficient space +* `--lang ` — Requested OCR languages (default: eng) + + + +
+ + + This document was generated automatically by + clap-markdown. + -- `` - Diagnostic code to explain (e.g., STRUCT_MISSING_KEY, STREAM_BOMB) (required) + + ## Hand-Curated Content > **Note:** Any content added after this marker will be preserved