# pdftract SDK Invocation Guide How to invoke the `pdftract` binary from various languages, both via subprocess and via the HTTP server mode. ## Binary Modes Reference ``` pdftract extract # JSON to stdout pdftract extract --text # plain text to stdout pdftract extract --output out.json # JSON to file pdftract serve --port 8080 # HTTP server: POST /extract → JSON pdftract mcp --bind 127.0.0.1:0 --auth-token-file token.txt # MCP server over HTTP or stdio ``` --- ## Subprocess Contract Every SDK invoking pdftract via subprocess MUST follow this contract. The contract defines the wire protocol between the SDK and the binary: argument layout, stream discipline, exit codes, and environment variable handling. ### argv Layout The canonical form an SDK SHOULD construct: ``` pdftract [GLOBAL_OPTIONS] [SUBCOMMAND_OPTIONS] ``` - **SUBCOMMAND**: `extract`, `serve`, `mcp`, `verify-receipt`, `inspect` - **GLOBAL_OPTIONS**: Flags that apply to all subcommands (`--help`, `--version`, `--config PATH`) - **POSITIONAL_ARGS**: Subcommand-specific arguments (e.g., PDF file path for `extract`) - **SUBCOMMAND_OPTIONS**: Flags specific to the subcommand (e.g., `--text`, `--json`, `--output PATH`) **Rules:** 1. Multi-value flags (e.g., `--profile NAME`) may be repeated; order is preserved. 2. Flag arguments MUST use `--flag=value` or `--flag value` syntax (both are accepted). 3. The PDF path is the first positional argument to `extract`. Use `-` to read PDF bytes from stdin (for remote sources or in-memory PDFs). 4. `--json` is implicit for `extract` when neither `--text` nor `--output PATH` is specified. 5. `--output PATH` writes JSON to a file; stdout contains only the path to that file on success. **Examples:** ```bash # Basic extraction (JSON to stdout) pdftract extract document.pdf # Plain text output pdftract extract document.pdf --text # JSON to file (stdout contains only the file path on success) pdftract extract document.pdf --output /tmp/result.json # With profile and cache pdftract extract document.pdf --profile scientific_paper --cache-dir /var/cache/pdftract # Remote source (PDF bytes fetched via HTTP, piped to stdin) curl -s https://example.com/doc.pdf | pdftract extract - # Multi-format output (JSON + Markdown + plain text) pdftract extract document.pdf --json --md --text --output-dir /tmp/outputs ``` ### stdin Discipline stdin is used for two purposes: password ingress and PDF bytes. **Password ingress (`--password-stdin`):** - When `--password-stdin` is present, pdftract reads **exactly one line** from stdin and uses it as the PDF password. - The line is stripped of the trailing newline but NOT whitespace-trimmed. - After reading the password, stdin is NOT consumed further; the PDF must be provided via a positional argument (not stdin). - The password value is NEVER logged, appears in no diagnostic output, and is redacted from `--capture-diagnostics` archives. - **TH-07**: `--password VALUE` on the command line is REJECTED unless `PDFTRACT_INSECURE_CLI_PASSWORD=1` is set. SDKs MUST use `--password-stdin` or `PDFTRACT_PASSWORD` instead. **PDF bytes from stdin:** - When the PDF path is `-`, pdftract reads the entire PDF byte stream from stdin. - This is the canonical way to handle remote sources (HTTP-fetched PDFs) or in-memory PDFs without writing to disk. - stdin is read to EOF; the binary does NOT prompt or interact. - When `-` is used as the path, `--password-stdin` cannot be used simultaneously (both would consume stdin). Use `PDFTRACT_PASSWORD` instead. **Example:** ```bash # Password via stdin echo "secret123" | pdftract extract --password-stdin encrypted.pdf # Remote PDF fetched via curl, piped to pdftract curl -s https://example.com/doc.pdf | pdftract extract - # DO NOT DO THIS (TH-07 violation -- rejected unless opt-in): pdftract extract encrypted.pdf --password secret123 ``` ### stdout Discipline stdout carries ONLY the extraction output in structured form. NOTHING else may be written to stdout. **`extract` subcommand:** - In `--json` mode (default): a single JSON object conforming to `docs/schema/v1.0/pdftract.schema.json`. No trailing newlines beyond the JSON structure. - In `--text` mode: plain text, UTF-8 encoded. Lines are separated by `\n`. No trailing metadata. - In `--output PATH` mode: the absolute path to the output file is written to stdout on success. On error, stderr contains the diagnostic and stdout is empty. - **Critical**: SDKs that mix log lines into stdout break JSON parsing. The binary MUST keep stdout clean. **`serve` / `mcp --bind` modes:** - stdout is NOT used for request responses. HTTP responses go to the socket; MCP JSON-RPC frames go to the transport (stdio for MCP stdio mode, HTTP for MCP `--bind` mode). - Log lines are routed to stderr via the `log` crate (see stderr discipline). **INV-9 (MCP stdio mode):** In MCP stdio mode, stdout MUST contain ONLY JSON-RPC frames. Any non-JSON-RPC byte breaks the protocol. ### stderr Discipline stderr carries human-readable logs, progress events, and diagnostics. The format is NOT machine-parseable (except for `--progress-json` mode, see below). **Log levels (controlled by `RUST_LOG`):** - `error`: Fatal errors that prevent extraction (e.g., "cannot open input file"). - `warn`: Non-fatal issues (e.g., "cache miss, extracting from PDF"). - `info` (default): High-level progress (e.g., "extracting page 5 of 10", "profile matched: scientific_paper"). - `debug`: Per-phase timing, resolved options (passwords redacted), per-page glyph/span counts. - `trace`: Detailed phase internals (cache key derivation steps, etc.). **Progress events (when `--progress-json` is set):** - Each event is emitted as a single-line JSON object on stderr, newline-delimited (ndjson format). - See `--progress-json` schema below. **NEVER logged at any level:** - Password values (PDF, MCP, inspector) — redacted as `` - Bearer-token values — redacted as `` - PDF byte contents — only the SHA-256 fingerprint is logged - Full extracted text — only span/page counts - `Cookie`, `Authorization`, or `Proxy-Authorization` HTTP headers ### Exit Code Taxonomy pdftract follows the sysexits(3) convention. Every exit code below 64 is reserved; codes 64–78 are application-specific. | Exit Code | Name | Meaning | TH Reference | |-----------|------|---------|--------------| | 0 | SUCCESS | Extraction completed successfully. | — | | 64 | USAGE_ERROR | Invalid command-line arguments, unknown flags, conflicting options. | — | | 65 | DATA_ERROR | Malformed PDF (cannot parse xref, trailer, or page tree). | — | | 66 | PASSWORD_MISSING | PDF is encrypted but no password was provided. | TH-07 | | 67 | CANNOT_OPEN_INPUT | File not found or permission denied. | — | | 70 | INTERNAL_ERROR | Unexpected panic or bug (should never happen in production). | INV-8 | | 73 | CANNOT_CREATE_OUTPUT | Cannot write to `--output PATH` (permission denied, disk full, etc.). | — | | 74 | IO_ERROR | Generic I/O error (read failure, network timeout for remote source). | — | | 75 | TEMP_FAILURE | Temporary failure; retry may succeed (e.g., remote source returned 503). | — | | 77 | PERMISSION_DENIED | Insufficient permissions (e.g., `--root DIR` traversal blocked). | TH-02 | | 78 | CONFIG_ERROR | Configuration error (invalid profile YAML, missing required `--auth-token` on public MCP bind). | TH-03 (line 874) | **TH-03 (exit 78):** `pdftract mcp --bind 0.0.0.0:PORT` without `--auth-token` or `PDFTRACT_MCP_TOKEN` aborts with exit code 78 and a stderr message explaining the risk. Loopback binds (`127.0.0.1`, `::1`) are exempt. **TH-07 (password handling):** Using `--password VALUE` without `PDFTRACT_INSECURE_CLI_PASSWORD=1` exits with code 64 (USAGE_ERROR) and a stderr hint to use `--password-stdin` or `PDFTRACT_PASSWORD` instead. ### Environment Variable Pass-Through The following environment variables are recognized by pdftract. SDKs SHOULD set them explicitly when the corresponding behavior is desired. | Variable | Purpose | Secret? | |----------|---------|---------| | `PDFTRACT_PASSWORD` | PDF decryption password. | YES — never logged | | `PDFTRACT_MCP_TOKEN` | MCP server bearer token (for `--auth-token`). | YES — never logged | | `PDFTRACT_INSECURE_CLI_PASSWORD` | Set to `1` to allow `--password VALUE` (TH-07 opt-out). | NO | | `PDFTRACT_INSECURE_CLI_TOKEN` | Set to `1` to allow `--auth-token VALUE`. | NO | | `RUST_LOG` | Log level filter (e.g., `pdftract=debug`). | NO | | `NO_COLOR` | Disable ANSI colors in stderr output. | NO | | `XDG_CONFIG_HOME` | Base directory for profile search (overrides `~/.config`). | NO | | `PDFTRACT_CONFIG_DIR` | Explicit profile directory path (overrides XDG default). | NO | **Secret handling:** - Secret-bearing variables (`PDFTRACT_PASSWORD`, `PDFTRACT_MCP_TOKEN`) are NEVER emitted in logs, diagnostics, or `--capture-diagnostics` archives. - They are held in `secrecy::SecretString` to prevent accidental `Debug` prints. ### `--progress-json` Event Schema When `--progress-json` is passed, pdftract emits newline-delimited JSON objects to stderr, one per event. This allows SDKs to parse progress without scraping human-readable logs. **Event types:** ```jsonc // Extraction started {"event":"open","fingerprint":"pdftract-v1:abcd...","path":"document.pdf","version":"1.0.0"} // Page processing started {"event":"page_started","page":5,"total":10} // Page processing completed {"event":"page_completed","page":5,"span_count":123,"block_count":12} // OCR started (Phase 5.4) {"event":"ocr_started","page":3,"engine":"tesseract","lang":"eng"} // OCR completed {"event":"ocr_completed","page":3,"duration_ms":1234} // Profile matched (Phase 7.10) {"event":"profile_matched","profile":"scientific_paper","priority":100} // Password received (TH-07 — NEVER includes the password value) {"event":"password_received","source":"stdin"} // or "env", "mcp_body", "form_field" // Extraction completed successfully {"event":"completed","duration_ms":5678,"page_count":10} // Fatal error (extraction aborted) {"event":"error","code":"PASSWORD_WRONG","message":"Incorrect password","exit_code":66} ``` **Parsing:** - Each line is valid JSON. SDKs read stderr line-by-line and `JSON.parse()` each line. - The `event` field discriminates the type; additional fields are event-specific. - Human-readable log lines are still emitted to stderr intermixed with JSON lines. SDKs should filter by attempting JSON parse first; lines that fail to parse are human logs. ### `--capture-diagnostics` Archive Layout When `--capture-diagnostics PATH` is passed, pdftract creates a diagnostic archive on error or when explicitly requested. The archive is attached to bug reports for reproduction. **Archive formats:** - `.zip` (default) — Use when `zip` command is available. - `.tar.gz` — Fallback when `zip` is not available. **Contained files:** ``` diagnostics-20260516-123456.zip ├── manifest.json # Archive metadata (version, timestamp, exit code) ├── runtime_config.json # Extraction options with secrets REDACTED ├── stderr.log # Captured stderr (passwords REDACTED) ├── pdf_fingerprint.txt # SHA-256 fingerprint of the input PDF ├── pdf_source_sanitized.pdf # PDF with all text content replaced by placeholders └── version.txt # `pdftract --version` output ``` **`manifest.json` schema:** ```json { "captured_at": "2026-05-16T12:34:56Z", "pdftract_version": "1.0.0", "exit_code": 65, "exit_reason": "DATA_ERROR", "diagnostic_codes": ["XREF_REPAIRED", "STREAM_BOMB"], "pdf_fingerprint": "pdftract-v1:abcd...", "options_redacted": true } ``` **`runtime_config.json` schema:** ```json { "subcommand": "extract", "args": ["document.pdf", "--profile", "scientific_paper"], "env": { "RUST_LOG": "pdftract=info", "PDFTRACT_PASSWORD": "", "PDFTRACT_MCP_TOKEN": "" } } ``` **Secret scrubbing (TH-08):** - `PDFTRACT_PASSWORD` value → `""` - `PDFTRACT_MCP_TOKEN` value → `""` - Full extracted text → NOT included (only span counts in stderr.log) - PDF source → `pdf_source_sanitized.pdf` replaces all text content with placeholder glyphs (`[` / `]`) but preserves structure **Rotation:** Archives are NOT auto-rotated. Operators MUST manage disk space manually. --- ## 1. Python ## JSON Output Schema ```json { "pages": [ { "page": 1, "spans": [ { "text": "Hello world", "bbox": [x0, y0, x1, y1], "font": "Helvetica", "size": 12.0, "confidence": 0.98 } ], "blocks": [ { "kind": "paragraph", "text": "Hello world", "bbox": [x0, y0, x1, y1] } ] } ], "metadata": { "title": "...", "author": "...", "page_count": 10 } } ``` --- ## 1. Python > **When to prefer subprocess:** one-off scripts, CLI pipelines, or when starting the server is not worth the overhead. > **When to prefer HTTP:** long-running services, parallel extraction across many files, or when sharing a single pdftract instance across multiple workers. ### Subprocess ```python import subprocess import json import os def extract_pdf_subprocess(pdf_path: str, password: str | None = None) -> dict: """Extract text from a PDF via subprocess and return the parsed JSON result. Args: pdf_path: Path to the PDF file. password: Optional PDF password. Passed via env var (TH-07 compliant). Returns: Parsed JSON output from pdftract. Raises: RuntimeError: If pdftract exits with a non-zero code. """ env = os.environ.copy() if password: # TH-07: Pass password via env var, NOT via --password flag. # Using --password VALUE is rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1. env["PDFTRACT_PASSWORD"] = password result = subprocess.run( ["pdftract", "extract", pdf_path], capture_output=True, text=True, env=env, ) if result.returncode != 0: raise RuntimeError( f"pdftract failed (exit {result.returncode}): {result.stderr.strip()}" ) return json.loads(result.stdout) def extract_pdf_password_stdin(pdf_path: str, password: str) -> dict: """Extract with password via --password-stdin (TH-07 compliant). This is the recommended method when you cannot use env vars (e.g., in restricted environments where env injection is not possible). """ result = subprocess.run( ["pdftract", "extract", "--password-stdin", pdf_path], input=password + "\n", # stdin: one line containing the password capture_output=True, text=True, ) if result.returncode != 0: raise RuntimeError( f"pdftract failed (exit {result.returncode}): {result.stderr.strip()}" ) return json.loads(result.stdout) def extract_pdf_from_bytes(pdf_bytes: bytes, password: str | None = None) -> dict: """Extract from in-memory PDF bytes (avoids writing to disk). The PDF is piped to pdftract via stdin using the special '-' path. When using stdin for the PDF, --password-stdin cannot be used simultaneously; use PDFTRACT_PASSWORD env var instead. """ env = os.environ.copy() if password: env["PDFTRACT_PASSWORD"] = password result = subprocess.run( ["pdftract", "extract", "-"], # '-' means read PDF from stdin input=pdf_bytes, capture_output=True, env=env, ) if result.returncode != 0: raise RuntimeError( f"pdftract failed (exit {result.returncode}): {result.stderr.strip()}" ) return json.loads(result.stdout) def full_text(data: dict) -> str: """Concatenate all block text across every page.""" parts = [] for page in data["pages"]: for block in page["blocks"]: parts.append(block["text"]) return "\n".join(parts) def page_text(data: dict, page_number: int) -> str: """Return concatenated block text for a single page (1-indexed).""" for page in data["pages"]: if page["page"] == page_number: return "\n".join(block["text"] for block in page["blocks"]) raise ValueError(f"Page {page_number} not found") if __name__ == "__main__": import sys pdf = sys.argv[1] # Example: extract with password # data = extract_pdf_subprocess(pdf, password="secret123") data = extract_pdf_subprocess(pdf) print(f"Title : {data['metadata'].get('title', '(none)')}") print(f"Pages : {data['metadata']['page_count']}") print() print("--- Full text ---") print(full_text(data)) print() print("--- Page 1 text ---") print(page_text(data, 1)) ``` ### HTTP (requests / httpx) ```python # pip install requests # pip install httpx # async alternative shown below import requests import json PDFTRACT_URL = "http://localhost:8080" def extract_pdf_http(pdf_path: str, password: str | None = None) -> dict: """POST a PDF file to pdftract serve and return the parsed JSON result. Args: pdf_path: Path to the PDF file. password: Optional PDF password (sent as multipart form field). Raises: requests.HTTPError: If the HTTP request fails. """ with open(pdf_path, "rb") as f: files = {"file": (pdf_path, f, "application/pdf")} data: dict[str, str] = {} if password: # TH-07: Password via form field is allowed (not exposed in ps/process list). data["password"] = password response = requests.post( f"{PDFTRACT_URL}/extract", files=files, data=data, timeout=60, ) response.raise_for_status() return response.json() def full_text(data: dict) -> str: parts = [] for page in data["pages"]: for block in page["blocks"]: parts.append(block["text"]) return "\n".join(parts) def page_text(data: dict, page_number: int) -> str: for page in data["pages"]: if page["page"] == page_number: return "\n".join(block["text"] for block in page["blocks"]) raise ValueError(f"Page {page_number} not found") # --- Async variant with httpx --- import asyncio import httpx async def extract_pdf_async(pdf_path: str) -> dict: async with httpx.AsyncClient(timeout=60) as client: with open(pdf_path, "rb") as f: response = await client.post( f"{PDFTRACT_URL}/extract", files={"file": (pdf_path, f, "application/pdf")}, ) response.raise_for_status() return response.json() if __name__ == "__main__": import sys pdf = sys.argv[1] # Synchronous data = extract_pdf_http(pdf) print(full_text(data)) # Asynchronous data = asyncio.run(extract_pdf_async(pdf)) print(full_text(data)) ``` --- ## 2. Node.js / JavaScript > **When to prefer subprocess:** build scripts, one-off tooling, or serverless functions where spinning up a child process is acceptable. > **When to prefer HTTP:** Express/Fastify services, or when pdftract is deployed as a sidecar or shared microservice. ### Subprocess (child_process) ```js // Node.js 18+ (ESM) import { execFile } from "node:child_process"; import { promisify } from "node:util"; const execFileAsync = promisify(execFile); /** * Extract text from a PDF via subprocess. * @param {string} pdfPath * @param {string} [password] Optional PDF password (TH-07: passed via env) * @returns {Promise} Parsed pdftract JSON */ async function extractPdfSubprocess(pdfPath, password) { const env = { ...process.env }; if (password) { // TH-07: Pass password via env var, NOT via --password flag. // Using --password VALUE is rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1. env.PDFTRACT_PASSWORD = password; } const { stdout, stderr } = await execFileAsync("pdftract", ["extract", pdfPath], { env, }).catch((err) => { throw new Error(`pdftract failed (exit ${err.code}): ${err.stderr}`); }); return JSON.parse(stdout); } /** * Extract with password via --password-stdin (TH-07 compliant). * @param {string} pdfPath * @param {string} password * @returns {Promise} */ async function extractPdfPasswordStdin(pdfPath, password) { const { execFile } = require("node:child_process"); return new Promise((resolve, reject) => { const proc = execFile("pdftract", ["extract", "--password-stdin", pdfPath]); let stdout = ""; let stderr = ""; proc.stdout.on("data", (data) => { stdout += data; }); proc.stderr.on("data", (data) => { stderr += data; }); proc.on("close", (code) => { if (code !== 0) { reject(new Error(`pdftract failed (exit ${code}): ${stderr}`)); } else { resolve(JSON.parse(stdout)); } }); // Write password to stdin, followed by newline proc.stdin.write(password + "\n"); proc.stdin.end(); }); } /** Concatenate all block text across every page. */ function fullText(data) { return data.pages .flatMap((page) => page.blocks.map((b) => b.text)) .join("\n"); } /** Return concatenated block text for a single page (1-indexed). */ function pageText(data, pageNumber) { const page = data.pages.find((p) => p.page === pageNumber); if (!page) throw new Error(`Page ${pageNumber} not found`); return page.blocks.map((b) => b.text).join("\n"); } // Usage const data = await extractPdfSubprocess(process.argv[2]); console.log("Title :", data.metadata.title ?? "(none)"); console.log("Pages :", data.metadata.page_count); console.log("\n--- Full text ---"); console.log(fullText(data)); console.log("\n--- Page 1 ---"); console.log(pageText(data, 1)); ``` ### HTTP (native fetch) ```js // Node.js 18+ — fetch is available globally; no extra dependencies required. import { readFile } from "node:fs/promises"; const PDFTRACT_URL = "http://localhost:8080"; /** * POST a PDF to pdftract serve. * @param {string} pdfPath * @param {string} [password] Optional PDF password (sent as form field) * @returns {Promise} Parsed pdftract JSON */ async function extractPdfHttp(pdfPath, password) { const bytes = await readFile(pdfPath); const blob = new Blob([bytes], { type: "application/pdf" }); const form = new FormData(); form.append("file", blob, pdfPath); if (password) { // TH-07: Password via form field is allowed. form.append("password", password); } const res = await fetch(`${PDFTRACT_URL}/extract`, { method: "POST", body: form, }); if (!res.ok) { const body = await res.text(); throw new Error(`pdftract HTTP ${res.status}: ${body}`); } return res.json(); } function fullText(data) { return data.pages .flatMap((page) => page.blocks.map((b) => b.text)) .join("\n"); } function pageText(data, pageNumber) { const page = data.pages.find((p) => p.page === pageNumber); if (!page) throw new Error(`Page ${pageNumber} not found`); return page.blocks.map((b) => b.text).join("\n"); } // Usage const data = await extractPdfHttp(process.argv[2]); console.log(fullText(data)); ``` --- ## 3. Go > **When to prefer subprocess:** CLI utilities or single-binary deployments where you want zero network overhead. > **When to prefer HTTP:** Go services handling concurrent requests — spin up pdftract serve once and hit it from multiple goroutines. ### Subprocess (os/exec) ```go package main import ( "encoding/json" "fmt" "log" "os" "os/exec" "strings" ) // extractSubprocess runs `pdftract extract ` and returns the parsed result. // If password is non-empty, it is passed via PDFTRACT_PASSWORD env var (TH-07 compliant). func extractSubprocess(pdfPath string, password string) (*PDFTractResult, error) { cmd := exec.Command("pdftract", "extract", pdfPath) if password != "" { // TH-07: Pass password via env var, NOT via --password flag. cmd.Env = append(os.Environ(), "PDFTRACT_PASSWORD="+password) } out, err := cmd.Output() if err != nil { if exitErr, ok := err.(*exec.ExitError); ok { return nil, fmt.Errorf("pdftract failed: %s", string(exitErr.Stderr)) } return nil, fmt.Errorf("exec error: %w", err) } var result PDFTractResult if err := json.Unmarshal(out, &result); err != nil { return nil, fmt.Errorf("json parse error: %w", err) } return &result, nil } type Span struct { Text string `json:"text"` BBox [4]float64 `json:"bbox"` Font string `json:"font"` Size float64 `json:"size"` Confidence float64 `json:"confidence"` } type Block struct { Kind string `json:"kind"` Text string `json:"text"` BBox [4]float64 `json:"bbox"` } type Page struct { Page int `json:"page"` Spans []Span `json:"spans"` Blocks []Block `json:"blocks"` } type Metadata struct { Title string `json:"title"` Author string `json:"author"` PageCount int `json:"page_count"` } type PDFTractResult struct { Pages []Page `json:"pages"` Metadata Metadata `json:"metadata"` } // extractSubprocess runs `pdftract extract ` and returns the parsed result. func extractSubprocess(pdfPath string) (*PDFTractResult, error) { out, err := exec.Command("pdftract", "extract", pdfPath).Output() if err != nil { if exitErr, ok := err.(*exec.ExitError); ok { return nil, fmt.Errorf("pdftract failed: %s", string(exitErr.Stderr)) } return nil, fmt.Errorf("exec error: %w", err) } var result PDFTractResult if err := json.Unmarshal(out, &result); err != nil { return nil, fmt.Errorf("json parse error: %w", err) } return &result, nil } // FullText concatenates all block text across every page. func (r *PDFTractResult) FullText() string { var sb strings.Builder for _, page := range r.Pages { for _, block := range page.Blocks { sb.WriteString(block.Text) sb.WriteByte('\n') } } return sb.String() } // PageText returns concatenated block text for a single page (1-indexed). func (r *PDFTractResult) PageText(pageNumber int) (string, error) { for _, page := range r.Pages { if page.Page == pageNumber { var sb strings.Builder for _, block := range page.Blocks { sb.WriteString(block.Text) sb.WriteByte('\n') } return sb.String(), nil } } return "", fmt.Errorf("page %d not found", pageNumber) } func main() { if len(os.Args) < 2 { log.Fatal("usage: program ") } result, err := extractSubprocess(os.Args[1]) if err != nil { log.Fatalf("extraction failed: %v", err) } fmt.Printf("Title : %s\n", result.Metadata.Title) fmt.Printf("Pages : %d\n", result.Metadata.PageCount) fmt.Println("\n--- Full text ---") fmt.Println(result.FullText()) p1, err := result.PageText(1) if err != nil { log.Printf("page 1: %v", err) } else { fmt.Println("--- Page 1 ---") fmt.Println(p1) } } ``` ### HTTP (net/http) ```go package main import ( "bytes" "encoding/json" "fmt" "io" "log" "mime/multipart" "net/http" "net/url" "os" "path/filepath" ) const pdftractURL = "http://localhost:8080" // extractHTTP POSTs a PDF file to pdftract serve. // If password is non-empty, it is sent as a multipart form field (TH-07 compliant). func extractHTTP(pdfPath string, password string) (*PDFTractResult, error) { f, err := os.Open(pdfPath) if err != nil { return nil, fmt.Errorf("open file: %w", err) } defer f.Close() var buf bytes.Buffer mw := multipart.NewWriter(&buf) part, err := mw.CreateFormFile("file", filepath.Base(pdfPath)) if err != nil { return nil, fmt.Errorf("create form file: %w", err) } if _, err := io.Copy(part, f); err != nil { return nil, fmt.Errorf("copy file: %w", err) } if password != "" { // TH-07: Password via form field is allowed. err = mw.WriteField("password", password) if err != nil { return nil, fmt.Errorf("write password field: %w", err) } } mw.Close() resp, err := http.Post( pdftractURL+"/extract", mw.FormDataContentType(), &buf, ) if err != nil { return nil, fmt.Errorf("http post: %w", err) } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { body, _ := io.ReadAll(resp.Body) return nil, fmt.Errorf("pdftract HTTP %d: %s", resp.StatusCode, body) } var result PDFTractResult if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return nil, fmt.Errorf("json decode: %w", err) } return &result, nil } func main() { if len(os.Args) < 2 { log.Fatal("usage: program ") } result, err := extractHTTP(os.Args[1]) if err != nil { log.Fatalf("extraction failed: %v", err) } fmt.Println(result.FullText()) } ``` --- ## 4. Ruby > **When to prefer subprocess:** Rake tasks, standalone scripts, or Rails background jobs without a persistent pdftract process. > **When to prefer HTTP:** Sidekiq workers or Rails requests — keep pdftract serve running as a separate process and hit it over loopback. ### Subprocess (Open3) ```ruby require "open3" require "json" # Extract text from a PDF via subprocess. # Returns a Hash parsed from pdftract's JSON output. # If password is provided, it is passed via env var (TH-07 compliant). def extract_pdf_subprocess(pdf_path, password: nil) env = {} env["PDFTRACT_PASSWORD"] = password if password stdout, stderr, status = Open3.capture3( env, "pdftract", "extract", pdf_path ) unless status.success? raise "pdftract failed (exit #{status.exitstatus}): #{stderr.strip}" end JSON.parse(stdout) end # Extract with password via --password-stdin (TH-07 compliant). def extract_pdf_password_stdin(pdf_path, password) require "open3" require "json" # Pass password via stdin; Open3 with :stdin_data is the cleanest way. stdout, stderr, status = Open3.capture3( "pdftract", "extract", "--password-stdin", pdf_path, stdin_data: password + "\n" ) unless status.success? raise "pdftract failed (exit #{status.exitstatus}): #{stderr.strip}" end JSON.parse(stdout) end # Concatenate all block text across every page. def full_text(data) data["pages"] .flat_map { |page| page["blocks"].map { |b| b["text"] } } .join("\n") end # Return concatenated block text for a single page (1-indexed). def page_text(data, page_number) page = data["pages"].find { |p| p["page"] == page_number } raise "Page #{page_number} not found" unless page page["blocks"].map { |b| b["text"] }.join("\n") end # Usage pdf_path = ARGV[0] || raise("Usage: ruby extract.rb ") data = extract_pdf_subprocess(pdf_path) puts "Title : #{data.dig("metadata", "title") || "(none)"}" puts "Pages : #{data.dig("metadata", "page_count")}" puts puts "--- Full text ---" puts full_text(data) puts puts "--- Page 1 ---" puts page_text(data, 1) ``` ### HTTP (net/http) ```ruby require "net/http" require "json" PDFTRACT_URL = URI("http://localhost:8080/extract") # POST a PDF file to pdftract serve. # If password is provided, it is sent as a multipart form field (TH-07 compliant). def extract_pdf_http(pdf_path, password: nil) boundary = "----pdftract#{rand(0xFFFFFF).to_s(16)}" body = build_multipart(pdf_path, boundary, password:) http = Net::HTTP.new(PDFTRACT_URL.host, PDFTRACT_URL.port) http.read_timeout = 60 request = Net::HTTP::Post.new(PDFTRACT_URL.path) request["Content-Type"] = "multipart/form-data; boundary=#{boundary}" request.body = body response = http.request(request) raise "pdftract HTTP #{response.code}: #{response.body}" unless response.is_a?(Net::HTTPSuccess) JSON.parse(response.body) end def build_multipart(pdf_path, boundary, password: nil) crlf = "\r\n" pdf_data = File.binread(pdf_path) filename = File.basename(pdf_path) parts = [ "--#{boundary}#{crlf}", "Content-Disposition: form-data; name=\"file\"; filename=\"#{filename}\"#{crlf}", "Content-Type: application/pdf#{crlf}", crlf, pdf_data, ] if password # TH-07: Password via form field is allowed. parts.concat([ "#{crlf}--#{boundary}#{crlf}", "Content-Disposition: form-data; name=\"password\"#{crlf}", crlf, password, ]) end parts.concat([ "#{crlf}--#{boundary}--#{crlf}", ]) parts.join end def full_text(data) data["pages"] .flat_map { |page| page["blocks"].map { |b| b["text"] } } .join("\n") end def page_text(data, page_number) page = data["pages"].find { |p| p["page"] == page_number } raise "Page #{page_number} not found" unless page page["blocks"].map { |b| b["text"] }.join("\n") end # Usage pdf_path = ARGV[0] || raise("Usage: ruby extract_http.rb ") data = extract_pdf_http(pdf_path) puts full_text(data) ``` --- ## 5. Java > **When to prefer subprocess:** batch jobs or standalone utilities. ProcessBuilder is simple and avoids a network stack. > **When to prefer HTTP:** Spring Boot services or multi-threaded apps — pdftract serve handles concurrent requests, while subprocess creates a new process per call. Requires Java 11+. No external dependencies — uses only the standard library. ### Subprocess (ProcessBuilder) ```java import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Map; /** * Invokes pdftract via subprocess and parses the JSON result. * * Dependency (Maven): * * com.fasterxml.jackson.core * jackson-databind * 2.17.0 * * * If you prefer no dependencies, replace ObjectMapper with org.json or * a manual string parse — the structure is straightforward. */ public class PdftractSubprocess { private static final ObjectMapper MAPPER = new ObjectMapper(); /** * Extract text from a PDF. * @param pdfPath Path to the PDF file. * @param password Optional PDF password (TH-07: passed via env var). */ public static JsonNode extract(String pdfPath, String password) throws IOException, InterruptedException { ProcessBuilder pb = new ProcessBuilder("pdftract", "extract", pdfPath); pb.redirectErrorStream(false); // keep stderr separate if (password != null && !password.isEmpty()) { // TH-07: Pass password via env var, NOT via --password flag. Map env = pb.environment(); env.put("PDFTRACT_PASSWORD", password); } Process process = pb.start(); byte[] stdout = process.getInputStream().readAllBytes(); byte[] stderr = process.getErrorStream().readAllBytes(); int exit = process.waitFor(); if (exit != 0) { throw new IOException( "pdftract failed (exit " + exit + "): " + new String(stderr).strip() ); } return MAPPER.readTree(stdout); } /** Concatenate all block text across every page. */ public static String fullText(JsonNode data) { List parts = new ArrayList<>(); for (JsonNode page : data.get("pages")) { for (JsonNode block : page.get("blocks")) { parts.add(block.get("text").asText()); } } return String.join("\n", parts); } /** Return concatenated block text for a single page (1-indexed). */ public static String pageText(JsonNode data, int pageNumber) { for (JsonNode page : data.get("pages")) { if (page.get("page").asInt() == pageNumber) { List parts = new ArrayList<>(); for (JsonNode block : page.get("blocks")) { parts.add(block.get("text").asText()); } return String.join("\n", parts); } } throw new IllegalArgumentException("Page " + pageNumber + " not found"); } public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage: PdftractSubprocess "); System.exit(1); } JsonNode data = extract(args[0]); JsonNode meta = data.get("metadata"); System.out.println("Title : " + meta.path("title").asText("(none)")); System.out.println("Pages : " + meta.get("page_count").asInt()); System.out.println("\n--- Full text ---"); System.out.println(fullText(data)); System.out.println("\n--- Page 1 ---"); System.out.println(pageText(data, 1)); } } ``` ### HTTP (java.net.http.HttpClient, Java 11+) ```java import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import java.io.IOException; import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Path; import java.time.Duration; import java.util.ArrayList; import java.util.List; import java.util.UUID; public class PdftractHttp { private static final String PDFTRACT_URL = "http://localhost:8080"; private static final ObjectMapper MAPPER = new ObjectMapper(); private static final HttpClient CLIENT = HttpClient.newBuilder() .connectTimeout(Duration.ofSeconds(10)) .build(); /** * Extract text from a PDF via HTTP. * @param pdfPath Path to the PDF file. * @param password Optional PDF password (TH-07: sent as form field). */ public static JsonNode extract(String pdfPath, String password) throws IOException, InterruptedException { Path path = Path.of(pdfPath); byte[] pdfBytes = Files.readAllBytes(path); String filename = path.getFileName().toString(); String boundary = UUID.randomUUID().toString().replace("-", ""); // Build multipart/form-data body manually (no external library needed) String crlf = "\r\n"; StringBuilder bodyBuilder = new StringBuilder(); // File part bodyBuilder.append("--").append(boundary).append(crlf); bodyBuilder.append("Content-Disposition: form-data; name=\"file\"; filename=\"") .append(filename).append("\"").append(crlf); bodyBuilder.append("Content-Type: application/pdf").append(crlf); bodyBuilder.append(crlf); byte[] headerBytes = bodyBuilder.toString().getBytes(StandardCharsets.UTF_8); byte[] footerBytes = (crlf + "--" + boundary + "--" + crlf).getBytes(StandardCharsets.UTF_8); byte[] passwordBytes = new byte[0]; if (password != null && !password.isEmpty()) { // TH-07: Password via form field is allowed. String passwordPart = crlf + "--" + boundary + crlf + "Content-Disposition: form-data; name=\"password\"" + crlf + crlf + password; passwordBytes = passwordPart.getBytes(StandardCharsets.UTF_8); } byte[] body = new byte[headerBytes.length + pdfBytes.length + passwordBytes.length + footerBytes.length]; int pos = 0; System.arraycopy(headerBytes, 0, body, pos, headerBytes.length); pos += headerBytes.length; System.arraycopy(pdfBytes, 0, body, pos, pdfBytes.length); pos += pdfBytes.length; System.arraycopy(passwordBytes, 0, body, pos, passwordBytes.length); pos += passwordBytes.length; System.arraycopy(footerBytes, 0, body, pos, footerBytes.length); HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(PDFTRACT_URL + "/extract")) .timeout(Duration.ofSeconds(60)) .header("Content-Type", "multipart/form-data; boundary=" + boundary) .POST(HttpRequest.BodyPublishers.ofByteArray(body)) .build(); HttpResponse response = CLIENT.send( request, HttpResponse.BodyHandlers.ofString() ); if (response.statusCode() != 200) { throw new IOException( "pdftract HTTP " + response.statusCode() + ": " + response.body() ); } return MAPPER.readTree(response.body()); } public static String fullText(JsonNode data) { List parts = new ArrayList<>(); for (JsonNode page : data.get("pages")) { for (JsonNode block : page.get("blocks")) { parts.add(block.get("text").asText()); } } return String.join("\n", parts); } public static String pageText(JsonNode data, int pageNumber) { for (JsonNode page : data.get("pages")) { if (page.get("page").asInt() == pageNumber) { List parts = new ArrayList<>(); for (JsonNode block : page.get("blocks")) { parts.add(block.get("text").asText()); } return String.join("\n", parts); } } throw new IllegalArgumentException("Page " + pageNumber + " not found"); } public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage: PdftractHttp "); System.exit(1); } JsonNode data = extract(args[0]); System.out.println(fullText(data)); } } ``` --- ## 6. Rust > **When to prefer subprocess:** CLI tools or single-threaded batch processors — zero extra dependencies beyond `serde_json`. > **When to prefer HTTP:** Async Tokio services — `reqwest` is non-blocking and naturally fits async Rust workloads. ### Subprocess (std::process::Command) Add to `Cargo.toml`: ```toml [dependencies] serde = { version = "1", features = ["derive"] } serde_json = "1" ``` ```rust use serde::Deserialize; use std::process::Command; use std::collections::HashMap as EnvMap; #[derive(Debug, Deserialize)] struct Span { pub text: String, pub bbox: [f64; 4], pub font: String, pub size: f64, pub confidence: f64, } #[derive(Debug, Deserialize)] struct Block { pub kind: String, pub text: String, pub bbox: [f64; 4], } #[derive(Debug, Deserialize)] struct Page { pub page: u32, pub spans: Vec, pub blocks: Vec, } #[derive(Debug, Deserialize)] struct Metadata { pub title: Option, pub author: Option, pub page_count: u32, } #[derive(Debug, Deserialize)] struct PdftractResult { pub pages: Vec, pub metadata: Metadata, } impl PdftractResult { /// Concatenate all block text across every page. pub fn full_text(&self) -> String { self.pages .iter() .flat_map(|p| p.blocks.iter().map(|b| b.text.as_str())) .collect::>() .join("\n") } /// Return concatenated block text for a single page (1-indexed). pub fn page_text(&self, page_number: u32) -> Option { self.pages .iter() .find(|p| p.page == page_number) .map(|p| { p.blocks .iter() .map(|b| b.text.as_str()) .collect::>() .join("\n") }) } } /// Extract text from a PDF via subprocess. /// If password is provided, it is passed via env var (TH-07 compliant). fn extract_subprocess(pdf_path: &str, password: Option<&str>) -> Result> { let mut cmd = Command::new("pdftract"); cmd.args(["extract", pdf_path]); if let Some(pwd) = password { // TH-07: Pass password via env var, NOT via --password flag. cmd.env("PDFTRACT_PASSWORD", pwd); } let output = cmd.output()?; if !output.status.success() { let stderr = String::from_utf8_lossy(&output.stderr); return Err(format!( "pdftract failed (exit {:?}): {}", output.status.code(), stderr.trim() ) .into()); } let result: PdftractResult = serde_json::from_slice(&output.stdout)?; Ok(result) } fn main() -> Result<(), Box> { let pdf_path = std::env::args() .nth(1) .ok_or("usage: program ")?; let result = extract_subprocess(&pdf_path)?; println!("Title : {}", result.metadata.title.as_deref().unwrap_or("(none)")); println!("Pages : {}", result.metadata.page_count); println!("\n--- Full text ---"); println!("{}", result.full_text()); if let Some(text) = result.page_text(1) { println!("\n--- Page 1 ---"); println!("{text}"); } Ok(()) } ``` ### HTTP (reqwest) Add to `Cargo.toml`: ```toml [dependencies] serde = { version = "1", features = ["derive"] } serde_json = "1" reqwest = { version = "0.12", features = ["multipart"] } tokio = { version = "1", features = ["full"] } ``` ```rust use reqwest::multipart; use serde::Deserialize; use std::path::Path; // Re-use the same structs from the subprocess example above. // (PdftractResult, Page, Block, Span, Metadata — copy them in) const PDFTRACT_URL: &str = "http://localhost:8080"; /// Extract text from a PDF via HTTP. /// If password is provided, it is sent as a multipart form field (TH-07 compliant). async fn extract_http(pdf_path: &str, password: Option<&str>) -> Result> { let bytes = tokio::fs::read(pdf_path).await?; let filename = Path::new(pdf_path) .file_name() .and_then(|n| n.to_str()) .unwrap_or("document.pdf") .to_owned(); let mut form = multipart::Form::new(); let file_part = multipart::Part::bytes(bytes) .file_name(filename) .mime_str("application/pdf")?; form = form.part("file", file_part); if let Some(pwd) = password { // TH-07: Password via form field is allowed. form = form.text("password", pwd.to_string()); } let client = reqwest::Client::new(); let response = client .post(format!("{PDFTRACT_URL}/extract")) .multipart(form) .timeout(std::time::Duration::from_secs(60)) .send() .await?; if !response.status().is_success() { let status = response.status(); let body = response.text().await.unwrap_or_default(); return Err(format!("pdftract HTTP {status}: {body}").into()); } let result: PdftractResult = response.json().await?; Ok(result) } #[tokio::main] async fn main() -> Result<(), Box> { let pdf_path = std::env::args() .nth(1) .ok_or("usage: program ")?; let result = extract_http(&pdf_path).await?; println!("{}", result.full_text()); if let Some(text) = result.page_text(1) { println!("\n--- Page 1 ---"); println!("{text}"); } Ok(()) } ``` --- ## Parsing `--progress-json` Events When `--progress-json` is passed, pdftract emits newline-delimited JSON objects to stderr. SDKs can parse these events to show progress bars, detect errors early, or log structured diagnostics. ### Python ```python import subprocess import json from typing import Any ProgressEvent = dict[str, Any] def extract_with_progress(pdf_path: str) -> dict: """Extract while parsing progress events from stderr.""" cmd = ["pdftract", "extract", "--progress-json", pdf_path] # stderr is line-buffered; each line is either JSON or a human log. process = subprocess.Popen( cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, ) result: dict | None = None for line in process.stderr: line = line.rstrip("\n") if not line: continue # Try to parse as JSON; if it fails, it's a human log line. try: event: ProgressEvent = json.loads(line) event_type = event.get("event") if event_type == "open": print(f"Opening {event['path']} (fingerprint: {event['fingerprint'][:16]}...)") elif event_type == "page_started": print(f"Page {event['page']}/{event['total']}...") elif event_type == "page_completed": print(f" → {event['span_count']} spans, {event['block_count']} blocks") elif event_type == "ocr_started": print(f" OCR (page {event['page']}, lang={event['lang']})...") elif event_type == "ocr_completed": print(f" OCR done in {event['duration_ms']}ms") elif event_type == "profile_matched": print(f"Profile: {event['profile']} (priority {event['priority']})") elif event_type == "password_received": # TH-07: The password value is NEVER in the event. print(f"Password received via {event['source']}") elif event_type == "completed": print(f"Done in {event['duration_ms']}ms, {event['page_count']} pages") elif event_type == "error": print(f"Error: {event['code']} - {event['message']}") except json.JSONDecodeError: # Human-readable log line (optional: ignore or log to file) print(f"[log] {line}") stdout, _ = process.communicate() if process.returncode != 0: raise RuntimeError(f"pdftract failed with exit {process.returncode}") return json.loads(stdout) ``` ### Node.js ```js import { execFile } from "node:child_process"; async function extractWithProgress(pdfPath) { const proc = execFile("pdftract", ["extract", "--progress-json", pdfPath]); let stdout = ""; proc.stderr.on("data", (data) => { for (const line of data.toString().split("\n")) { if (!line.trim()) continue; try { const event = JSON.parse(line); switch (event.event) { case "open": console.log(`Opening ${event.path}`); break; case "page_started": console.log(`Page ${event.page}/${event.total}...`); break; case "page_completed": console.log(` → ${event.span_count} spans, ${event.block_count} blocks`); break; case "ocr_started": console.log(` OCR (page ${event.page}, lang=${event.lang})...`); break; case "ocr_completed": console.log(` OCR done in ${event.duration_ms}ms`); break; case "profile_matched": console.log(`Profile: ${event.profile} (priority ${event.priority})`); break; case "password_received": console.log(`Password received via ${event.source}`); break; case "completed": console.log(`Done in ${event.duration_ms}ms, ${event.page_count} pages`); break; case "error": console.error(`Error: ${event.code} - ${event.message}`); break; } } catch (e) { // Not JSON — human log line console.log(`[log] ${line}`); } } }); return new Promise((resolve, reject) => { proc.stdout.on("data", (d) => { stdout += d; }); proc.on("close", (code) => { if (code !== 0) { reject(new Error(`pdftract failed with exit ${code}`)); } else { resolve(JSON.parse(stdout)); } }); }); } ``` ### Rust ```rust use std::process::{Command, Stdio}; use std::io::{BufRead, BufReader}; use serde_json::Value; fn extract_with_progress(pdf_path: &str) -> Result> { let mut child = Command::new("pdftract") .args(["extract", "--progress-json", pdf_path]) .stdout(Stdio::piped()) .stderr(Stdio::piped()) .spawn()?; let stderr = child.stderr.take().expect("stderr"); let reader = BufReader::new(stderr); for line in reader.lines() { let line = line?; if line.is_empty() { continue; } // Try to parse as JSON if let Ok(event) = serde_json::from_str::(&line) { let event_type = event.get("event").and_then(|v| v.as_str()); match event_type { Some("open") => { let path = event.get("path").and_then(|v| v.as_str()).unwrap_or("?"); println!("Opening {}", path); } Some("page_started") => { let page = event.get("page").and_then(|v| v.as_u64()).unwrap_or(0); let total = event.get("total").and_then(|v| v.as_u64()).unwrap_or(0); println!("Page {}/{}...", page, total); } Some("page_completed") => { let spans = event.get("span_count").and_then(|v| v.as_u64()).unwrap_or(0); let blocks = event.get("block_count").and_then(|v| v.as_u64()).unwrap_or(0); println!(" → {} spans, {} blocks", spans, blocks); } Some("ocr_started") => { let page = event.get("page").and_then(|v| v.as_u64()).unwrap_or(0); let lang = event.get("lang").and_then(|v| v.as_str()).unwrap_or("?"); println!(" OCR (page {}, lang={})...", page, lang); } Some("ocr_completed") => { let ms = event.get("duration_ms").and_then(|v| v.as_u64()).unwrap_or(0); println!(" OCR done in {}ms", ms); } Some("profile_matched") => { let profile = event.get("profile").and_then(|v| v.as_str()).unwrap_or("?"); let priority = event.get("priority").and_then(|v| v.as_u64()).unwrap_or(0); println!("Profile: {} (priority {})", profile, priority); } Some("password_received") => { let source = event.get("source").and_then(|v| v.as_str()).unwrap_or("?"); println!("Password received via {}", source); } Some("completed") => { let ms = event.get("duration_ms").and_then(|v| v.as_u64()).unwrap_or(0); let pages = event.get("page_count").and_then(|v| v.as_u64()).unwrap_or(0); println!("Done in {}ms, {} pages", ms, pages); } Some("error") => { let code = event.get("code").and_then(|v| v.as_str()).unwrap_or("?"); let msg = event.get("message").and_then(|v| v.as_str()).unwrap_or("?"); eprintln!("Error: {} - {}", code, msg); } _ => { // Unknown event type or malformed JSON println!("[log] {}", line); } } } else { // Not JSON — human log line println!("[log] {}", line); } } let output = child.wait_with_output()?; if !output.status.success() { let stderr = String::from_utf8_lossy(&output.stderr); return Err(format!("pdftract failed: {}", stderr).into()); } let result: PdftractResult = serde_json::from_slice(&output.stdout)?; Ok(result) } ``` --- ## 7. Shell / Bash > **When to prefer direct invocation:** shell scripts, cron jobs, CI pipelines, or any context where you have direct access to the binary. > **When to prefer curl:** when pdftract is running as a shared service on another host, inside a container, or when you want to avoid installing the binary locally. ### Direct Invocation ```bash #!/usr/bin/env bash set -euo pipefail PDF="${1:?Usage: $0 }" # --- JSON output --- json=$(pdftract extract "$PDF") # Full text via jq: collect all block text across all pages full_text=$(echo "$json" | jq -r '[.pages[].blocks[].text] | join("\n")') # Per-page text (page 1) page1_text=$(echo "$json" | jq -r '.pages[] | select(.page == 1) | [.blocks[].text] | join("\n")') # Metadata title=$(echo "$json" | jq -r '.metadata.title // "(none)"') pages=$(echo "$json" | jq -r '.metadata.page_count') echo "Title : $title" echo "Pages : $pages" echo echo "--- Full text ---" echo "$full_text" echo echo "--- Page 1 ---" echo "$page1_text" # --- Plain text output (no jq needed) --- plain=$(pdftract extract "$PDF" --text) echo echo "--- Plain text (--text flag) ---" echo "$plain" # --- Write JSON to file --- pdftract extract "$PDF" --output "/tmp/$(basename "$PDF" .pdf).json" echo "JSON written to /tmp/$(basename "$PDF" .pdf).json" ``` ### curl (HTTP) ```bash #!/usr/bin/env bash set -euo pipefail PDF="${1:?Usage: $0 }" PDFTRACT_URL="${PDFTRACT_URL:-http://localhost:8080}" # POST the PDF and capture the response; fail fast on HTTP errors. json=$(curl --silent --show-error --fail \ --max-time 60 \ -F "file=@${PDF};type=application/pdf" \ "${PDFTRACT_URL}/extract") # Full text via jq full_text=$(echo "$json" | jq -r '[.pages[].blocks[].text] | join("\n")') # Per-page text (page 1) page1_text=$(echo "$json" | jq -r '.pages[] | select(.page == 1) | [.blocks[].text] | join("\n")') # Metadata title=$(echo "$json" | jq -r '.metadata.title // "(none)"') pages=$(echo "$json" | jq -r '.metadata.page_count') echo "Title : $title" echo "Pages : $pages" echo echo "--- Full text ---" echo "$full_text" echo echo "--- Page 1 ---" echo "$page1_text" # --- Save raw JSON --- output_file="/tmp/$(basename "$PDF" .pdf).json" echo "$json" > "$output_file" echo "JSON saved to $output_file" # --- Health check before submitting --- # curl -sf "${PDFTRACT_URL}/health" > /dev/null \ # || { echo "pdftract serve is not running at ${PDFTRACT_URL}"; exit 1; } ``` ### Batch processing with xargs / parallel ```bash #!/usr/bin/env bash # Process every PDF in a directory, writing one JSON file per PDF. # Uses GNU parallel if available, otherwise xargs -P. PDF_DIR="${1:?Usage: $0 }" OUT_DIR="${2:-/tmp/pdftract-out}" mkdir -p "$OUT_DIR" extract_one() { local pdf="$1" local out="$OUT_DIR/$(basename "$pdf" .pdf).json" pdftract extract "$pdf" --output "$out" && echo "OK $pdf" || echo "ERR $pdf" } export -f extract_one export OUT_DIR find "$PDF_DIR" -name "*.pdf" -print0 \ | xargs -0 -P 4 -I{} bash -c 'extract_one "$@"' _ {} ```