pdftract/docs/notes/sdk-invocation.md
jedarden 57df42f478 docs(pdftract-3b1x): finalize sdk-invocation.md with subprocess contract and TH-07 compliance
Add comprehensive "Subprocess Contract" section documenting:
- argv layout with canonical form
- stdin discipline (password ingress, PDF bytes from stdin)
- stdout/stderr discipline (what goes where, what never gets logged)
- Exit code taxonomy (0, 64-78) with TH-03 (exit 78) and TH-07 (exit 64) refs
- Environment variable pass-through (PDFTRACT_PASSWORD, PDFTRACT_MCP_TOKEN, etc.)
- --progress-json event schema (ndjson format, all event types)
- --capture-diagnostics archive layout (zip/tar, contained files, scrubbing rules)

Update all language examples (Python, Node.js, Go, Ruby, Java, Rust) with
TH-07-compliant password handling:
- Pass password via PDFTRACT_PASSWORD env var (subprocess)
- Pass password via multipart form field (HTTP)
- Never use --password VALUE flag (rejected unless opt-in)

Add progress JSON parsing examples for Python, Node.js, and Rust showing
real-world event-driven progress tracking.

File grows from 1100 to 1837 lines (+737 lines, ~67%).

Closes: pdftract-3b1x
2026-05-24 07:48:09 -04:00

1837 lines
58 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# pdftract SDK Invocation Guide
How to invoke the `pdftract` binary from various languages, both via subprocess and via the HTTP server mode.
## Binary Modes Reference
```
pdftract extract <file.pdf> # JSON to stdout
pdftract extract <file.pdf> --text # plain text to stdout
pdftract extract <file.pdf> --output out.json # JSON to file
pdftract serve --port 8080 # HTTP server: POST /extract → JSON
pdftract mcp --bind 127.0.0.1:0 --auth-token-file token.txt # MCP server over HTTP or stdio
```
---
## Subprocess Contract
Every SDK invoking pdftract via subprocess MUST follow this contract. The contract defines the wire protocol between the SDK and the binary: argument layout, stream discipline, exit codes, and environment variable handling.
### argv Layout
The canonical form an SDK SHOULD construct:
```
pdftract <SUBCOMMAND> [GLOBAL_OPTIONS] <POSITIONAL_ARGS> [SUBCOMMAND_OPTIONS]
```
- **SUBCOMMAND**: `extract`, `serve`, `mcp`, `verify-receipt`, `inspect`
- **GLOBAL_OPTIONS**: Flags that apply to all subcommands (`--help`, `--version`, `--config PATH`)
- **POSITIONAL_ARGS**: Subcommand-specific arguments (e.g., PDF file path for `extract`)
- **SUBCOMMAND_OPTIONS**: Flags specific to the subcommand (e.g., `--text`, `--json`, `--output PATH`)
**Rules:**
1. Multi-value flags (e.g., `--profile NAME`) may be repeated; order is preserved.
2. Flag arguments MUST use `--flag=value` or `--flag value` syntax (both are accepted).
3. The PDF path is the first positional argument to `extract`. Use `-` to read PDF bytes from stdin (for remote sources or in-memory PDFs).
4. `--json` is implicit for `extract` when neither `--text` nor `--output PATH` is specified.
5. `--output PATH` writes JSON to a file; stdout contains only the path to that file on success.
**Examples:**
```bash
# Basic extraction (JSON to stdout)
pdftract extract document.pdf
# Plain text output
pdftract extract document.pdf --text
# JSON to file (stdout contains only the file path on success)
pdftract extract document.pdf --output /tmp/result.json
# With profile and cache
pdftract extract document.pdf --profile scientific_paper --cache-dir /var/cache/pdftract
# Remote source (PDF bytes fetched via HTTP, piped to stdin)
curl -s https://example.com/doc.pdf | pdftract extract -
# Multi-format output (JSON + Markdown + plain text)
pdftract extract document.pdf --json --md --text --output-dir /tmp/outputs
```
### stdin Discipline
stdin is used for two purposes: password ingress and PDF bytes.
**Password ingress (`--password-stdin`):**
- When `--password-stdin` is present, pdftract reads **exactly one line** from stdin and uses it as the PDF password.
- The line is stripped of the trailing newline but NOT whitespace-trimmed.
- After reading the password, stdin is NOT consumed further; the PDF must be provided via a positional argument (not stdin).
- The password value is NEVER logged, appears in no diagnostic output, and is redacted from `--capture-diagnostics` archives.
- **TH-07**: `--password VALUE` on the command line is REJECTED unless `PDFTRACT_INSECURE_CLI_PASSWORD=1` is set. SDKs MUST use `--password-stdin` or `PDFTRACT_PASSWORD` instead.
**PDF bytes from stdin:**
- When the PDF path is `-`, pdftract reads the entire PDF byte stream from stdin.
- This is the canonical way to handle remote sources (HTTP-fetched PDFs) or in-memory PDFs without writing to disk.
- stdin is read to EOF; the binary does NOT prompt or interact.
- When `-` is used as the path, `--password-stdin` cannot be used simultaneously (both would consume stdin). Use `PDFTRACT_PASSWORD` instead.
**Example:**
```bash
# Password via stdin
echo "secret123" | pdftract extract --password-stdin encrypted.pdf
# Remote PDF fetched via curl, piped to pdftract
curl -s https://example.com/doc.pdf | pdftract extract -
# DO NOT DO THIS (TH-07 violation -- rejected unless opt-in):
pdftract extract encrypted.pdf --password secret123
```
### stdout Discipline
stdout carries ONLY the extraction output in structured form. NOTHING else may be written to stdout.
**`extract` subcommand:**
- In `--json` mode (default): a single JSON object conforming to `docs/schema/v1.0/pdftract.schema.json`. No trailing newlines beyond the JSON structure.
- In `--text` mode: plain text, UTF-8 encoded. Lines are separated by `\n`. No trailing metadata.
- In `--output PATH` mode: the absolute path to the output file is written to stdout on success. On error, stderr contains the diagnostic and stdout is empty.
- **Critical**: SDKs that mix log lines into stdout break JSON parsing. The binary MUST keep stdout clean.
**`serve` / `mcp --bind` modes:**
- stdout is NOT used for request responses. HTTP responses go to the socket; MCP JSON-RPC frames go to the transport (stdio for MCP stdio mode, HTTP for MCP `--bind` mode).
- Log lines are routed to stderr via the `log` crate (see stderr discipline).
**INV-9 (MCP stdio mode):** In MCP stdio mode, stdout MUST contain ONLY JSON-RPC frames. Any non-JSON-RPC byte breaks the protocol.
### stderr Discipline
stderr carries human-readable logs, progress events, and diagnostics. The format is NOT machine-parseable (except for `--progress-json` mode, see below).
**Log levels (controlled by `RUST_LOG`):**
- `error`: Fatal errors that prevent extraction (e.g., "cannot open input file").
- `warn`: Non-fatal issues (e.g., "cache miss, extracting from PDF").
- `info` (default): High-level progress (e.g., "extracting page 5 of 10", "profile matched: scientific_paper").
- `debug`: Per-phase timing, resolved options (passwords redacted), per-page glyph/span counts.
- `trace`: Detailed phase internals (cache key derivation steps, etc.).
**Progress events (when `--progress-json` is set):**
- Each event is emitted as a single-line JSON object on stderr, newline-delimited (ndjson format).
- See `--progress-json` schema below.
**NEVER logged at any level:**
- Password values (PDF, MCP, inspector) — redacted as `<redacted>`
- Bearer-token values — redacted as `<redacted>`
- PDF byte contents — only the SHA-256 fingerprint is logged
- Full extracted text — only span/page counts
- `Cookie`, `Authorization`, or `Proxy-Authorization` HTTP headers
### Exit Code Taxonomy
pdftract follows the sysexits(3) convention. Every exit code below 64 is reserved; codes 6478 are application-specific.
| Exit Code | Name | Meaning | TH Reference |
|-----------|------|---------|--------------|
| 0 | SUCCESS | Extraction completed successfully. | — |
| 64 | USAGE_ERROR | Invalid command-line arguments, unknown flags, conflicting options. | — |
| 65 | DATA_ERROR | Malformed PDF (cannot parse xref, trailer, or page tree). | — |
| 66 | PASSWORD_MISSING | PDF is encrypted but no password was provided. | TH-07 |
| 67 | CANNOT_OPEN_INPUT | File not found or permission denied. | — |
| 70 | INTERNAL_ERROR | Unexpected panic or bug (should never happen in production). | INV-8 |
| 73 | CANNOT_CREATE_OUTPUT | Cannot write to `--output PATH` (permission denied, disk full, etc.). | — |
| 74 | IO_ERROR | Generic I/O error (read failure, network timeout for remote source). | — |
| 75 | TEMP_FAILURE | Temporary failure; retry may succeed (e.g., remote source returned 503). | — |
| 77 | PERMISSION_DENIED | Insufficient permissions (e.g., `--root DIR` traversal blocked). | TH-02 |
| 78 | CONFIG_ERROR | Configuration error (invalid profile YAML, missing required `--auth-token` on public MCP bind). | TH-03 (line 874) |
**TH-03 (exit 78):** `pdftract mcp --bind 0.0.0.0:PORT` without `--auth-token` or `PDFTRACT_MCP_TOKEN` aborts with exit code 78 and a stderr message explaining the risk. Loopback binds (`127.0.0.1`, `::1`) are exempt.
**TH-07 (password handling):** Using `--password VALUE` without `PDFTRACT_INSECURE_CLI_PASSWORD=1` exits with code 64 (USAGE_ERROR) and a stderr hint to use `--password-stdin` or `PDFTRACT_PASSWORD` instead.
### Environment Variable Pass-Through
The following environment variables are recognized by pdftract. SDKs SHOULD set them explicitly when the corresponding behavior is desired.
| Variable | Purpose | Secret? |
|----------|---------|---------|
| `PDFTRACT_PASSWORD` | PDF decryption password. | YES — never logged |
| `PDFTRACT_MCP_TOKEN` | MCP server bearer token (for `--auth-token`). | YES — never logged |
| `PDFTRACT_INSECURE_CLI_PASSWORD` | Set to `1` to allow `--password VALUE` (TH-07 opt-out). | NO |
| `PDFTRACT_INSECURE_CLI_TOKEN` | Set to `1` to allow `--auth-token VALUE`. | NO |
| `RUST_LOG` | Log level filter (e.g., `pdftract=debug`). | NO |
| `NO_COLOR` | Disable ANSI colors in stderr output. | NO |
| `XDG_CONFIG_HOME` | Base directory for profile search (overrides `~/.config`). | NO |
| `PDFTRACT_CONFIG_DIR` | Explicit profile directory path (overrides XDG default). | NO |
**Secret handling:**
- Secret-bearing variables (`PDFTRACT_PASSWORD`, `PDFTRACT_MCP_TOKEN`) are NEVER emitted in logs, diagnostics, or `--capture-diagnostics` archives.
- They are held in `secrecy::SecretString` to prevent accidental `Debug` prints.
### `--progress-json` Event Schema
When `--progress-json` is passed, pdftract emits newline-delimited JSON objects to stderr, one per event. This allows SDKs to parse progress without scraping human-readable logs.
**Event types:**
```jsonc
// Extraction started
{"event":"open","fingerprint":"pdftract-v1:abcd...","path":"document.pdf","version":"1.0.0"}
// Page processing started
{"event":"page_started","page":5,"total":10}
// Page processing completed
{"event":"page_completed","page":5,"span_count":123,"block_count":12}
// OCR started (Phase 5.4)
{"event":"ocr_started","page":3,"engine":"tesseract","lang":"eng"}
// OCR completed
{"event":"ocr_completed","page":3,"duration_ms":1234}
// Profile matched (Phase 7.10)
{"event":"profile_matched","profile":"scientific_paper","priority":100}
// Password received (TH-07 — NEVER includes the password value)
{"event":"password_received","source":"stdin"} // or "env", "mcp_body", "form_field"
// Extraction completed successfully
{"event":"completed","duration_ms":5678,"page_count":10}
// Fatal error (extraction aborted)
{"event":"error","code":"PASSWORD_WRONG","message":"Incorrect password","exit_code":66}
```
**Parsing:**
- Each line is valid JSON. SDKs read stderr line-by-line and `JSON.parse()` each line.
- The `event` field discriminates the type; additional fields are event-specific.
- Human-readable log lines are still emitted to stderr intermixed with JSON lines. SDKs should filter by attempting JSON parse first; lines that fail to parse are human logs.
### `--capture-diagnostics` Archive Layout
When `--capture-diagnostics PATH` is passed, pdftract creates a diagnostic archive on error or when explicitly requested. The archive is attached to bug reports for reproduction.
**Archive formats:**
- `.zip` (default) — Use when `zip` command is available.
- `.tar.gz` — Fallback when `zip` is not available.
**Contained files:**
```
diagnostics-20260516-123456.zip
├── manifest.json # Archive metadata (version, timestamp, exit code)
├── runtime_config.json # Extraction options with secrets REDACTED
├── stderr.log # Captured stderr (passwords REDACTED)
├── pdf_fingerprint.txt # SHA-256 fingerprint of the input PDF
├── pdf_source_sanitized.pdf # PDF with all text content replaced by placeholders
└── version.txt # `pdftract --version` output
```
**`manifest.json` schema:**
```json
{
"captured_at": "2026-05-16T12:34:56Z",
"pdftract_version": "1.0.0",
"exit_code": 65,
"exit_reason": "DATA_ERROR",
"diagnostic_codes": ["XREF_REPAIRED", "STREAM_BOMB"],
"pdf_fingerprint": "pdftract-v1:abcd...",
"options_redacted": true
}
```
**`runtime_config.json` schema:**
```json
{
"subcommand": "extract",
"args": ["document.pdf", "--profile", "scientific_paper"],
"env": {
"RUST_LOG": "pdftract=info",
"PDFTRACT_PASSWORD": "<redacted>",
"PDFTRACT_MCP_TOKEN": "<redacted>"
}
}
```
**Secret scrubbing (TH-08):**
- `PDFTRACT_PASSWORD` value → `"<redacted>"`
- `PDFTRACT_MCP_TOKEN` value → `"<redacted>"`
- Full extracted text → NOT included (only span counts in stderr.log)
- PDF source → `pdf_source_sanitized.pdf` replaces all text content with placeholder glyphs (`[` / `]`) but preserves structure
**Rotation:** Archives are NOT auto-rotated. Operators MUST manage disk space manually.
---
## 1. Python
## JSON Output Schema
```json
{
"pages": [
{
"page": 1,
"spans": [
{
"text": "Hello world",
"bbox": [x0, y0, x1, y1],
"font": "Helvetica",
"size": 12.0,
"confidence": 0.98
}
],
"blocks": [
{
"kind": "paragraph",
"text": "Hello world",
"bbox": [x0, y0, x1, y1]
}
]
}
],
"metadata": {
"title": "...",
"author": "...",
"page_count": 10
}
}
```
---
## 1. Python
> **When to prefer subprocess:** one-off scripts, CLI pipelines, or when starting the server is not worth the overhead.
> **When to prefer HTTP:** long-running services, parallel extraction across many files, or when sharing a single pdftract instance across multiple workers.
### Subprocess
```python
import subprocess
import json
import os
def extract_pdf_subprocess(pdf_path: str, password: str | None = None) -> dict:
"""Extract text from a PDF via subprocess and return the parsed JSON result.
Args:
pdf_path: Path to the PDF file.
password: Optional PDF password. Passed via env var (TH-07 compliant).
Returns:
Parsed JSON output from pdftract.
Raises:
RuntimeError: If pdftract exits with a non-zero code.
"""
env = os.environ.copy()
if password:
# TH-07: Pass password via env var, NOT via --password flag.
# Using --password VALUE is rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1.
env["PDFTRACT_PASSWORD"] = password
result = subprocess.run(
["pdftract", "extract", pdf_path],
capture_output=True,
text=True,
env=env,
)
if result.returncode != 0:
raise RuntimeError(
f"pdftract failed (exit {result.returncode}): {result.stderr.strip()}"
)
return json.loads(result.stdout)
def extract_pdf_password_stdin(pdf_path: str, password: str) -> dict:
"""Extract with password via --password-stdin (TH-07 compliant).
This is the recommended method when you cannot use env vars (e.g., in
restricted environments where env injection is not possible).
"""
result = subprocess.run(
["pdftract", "extract", "--password-stdin", pdf_path],
input=password + "\n", # stdin: one line containing the password
capture_output=True,
text=True,
)
if result.returncode != 0:
raise RuntimeError(
f"pdftract failed (exit {result.returncode}): {result.stderr.strip()}"
)
return json.loads(result.stdout)
def extract_pdf_from_bytes(pdf_bytes: bytes, password: str | None = None) -> dict:
"""Extract from in-memory PDF bytes (avoids writing to disk).
The PDF is piped to pdftract via stdin using the special '-' path.
When using stdin for the PDF, --password-stdin cannot be used simultaneously;
use PDFTRACT_PASSWORD env var instead.
"""
env = os.environ.copy()
if password:
env["PDFTRACT_PASSWORD"] = password
result = subprocess.run(
["pdftract", "extract", "-"], # '-' means read PDF from stdin
input=pdf_bytes,
capture_output=True,
env=env,
)
if result.returncode != 0:
raise RuntimeError(
f"pdftract failed (exit {result.returncode}): {result.stderr.strip()}"
)
return json.loads(result.stdout)
def full_text(data: dict) -> str:
"""Concatenate all block text across every page."""
parts = []
for page in data["pages"]:
for block in page["blocks"]:
parts.append(block["text"])
return "\n".join(parts)
def page_text(data: dict, page_number: int) -> str:
"""Return concatenated block text for a single page (1-indexed)."""
for page in data["pages"]:
if page["page"] == page_number:
return "\n".join(block["text"] for block in page["blocks"])
raise ValueError(f"Page {page_number} not found")
if __name__ == "__main__":
import sys
pdf = sys.argv[1]
# Example: extract with password
# data = extract_pdf_subprocess(pdf, password="secret123")
data = extract_pdf_subprocess(pdf)
print(f"Title : {data['metadata'].get('title', '(none)')}")
print(f"Pages : {data['metadata']['page_count']}")
print()
print("--- Full text ---")
print(full_text(data))
print()
print("--- Page 1 text ---")
print(page_text(data, 1))
```
### HTTP (requests / httpx)
```python
# pip install requests
# pip install httpx # async alternative shown below
import requests
import json
PDFTRACT_URL = "http://localhost:8080"
def extract_pdf_http(pdf_path: str, password: str | None = None) -> dict:
"""POST a PDF file to pdftract serve and return the parsed JSON result.
Args:
pdf_path: Path to the PDF file.
password: Optional PDF password (sent as multipart form field).
Raises:
requests.HTTPError: If the HTTP request fails.
"""
with open(pdf_path, "rb") as f:
files = {"file": (pdf_path, f, "application/pdf")}
data: dict[str, str] = {}
if password:
# TH-07: Password via form field is allowed (not exposed in ps/process list).
data["password"] = password
response = requests.post(
f"{PDFTRACT_URL}/extract",
files=files,
data=data,
timeout=60,
)
response.raise_for_status()
return response.json()
def full_text(data: dict) -> str:
parts = []
for page in data["pages"]:
for block in page["blocks"]:
parts.append(block["text"])
return "\n".join(parts)
def page_text(data: dict, page_number: int) -> str:
for page in data["pages"]:
if page["page"] == page_number:
return "\n".join(block["text"] for block in page["blocks"])
raise ValueError(f"Page {page_number} not found")
# --- Async variant with httpx ---
import asyncio
import httpx
async def extract_pdf_async(pdf_path: str) -> dict:
async with httpx.AsyncClient(timeout=60) as client:
with open(pdf_path, "rb") as f:
response = await client.post(
f"{PDFTRACT_URL}/extract",
files={"file": (pdf_path, f, "application/pdf")},
)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
import sys
pdf = sys.argv[1]
# Synchronous
data = extract_pdf_http(pdf)
print(full_text(data))
# Asynchronous
data = asyncio.run(extract_pdf_async(pdf))
print(full_text(data))
```
---
## 2. Node.js / JavaScript
> **When to prefer subprocess:** build scripts, one-off tooling, or serverless functions where spinning up a child process is acceptable.
> **When to prefer HTTP:** Express/Fastify services, or when pdftract is deployed as a sidecar or shared microservice.
### Subprocess (child_process)
```js
// Node.js 18+ (ESM)
import { execFile } from "node:child_process";
import { promisify } from "node:util";
const execFileAsync = promisify(execFile);
/**
* Extract text from a PDF via subprocess.
* @param {string} pdfPath
* @param {string} [password] Optional PDF password (TH-07: passed via env)
* @returns {Promise<object>} Parsed pdftract JSON
*/
async function extractPdfSubprocess(pdfPath, password) {
const env = { ...process.env };
if (password) {
// TH-07: Pass password via env var, NOT via --password flag.
// Using --password VALUE is rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1.
env.PDFTRACT_PASSWORD = password;
}
const { stdout, stderr } = await execFileAsync("pdftract", ["extract", pdfPath], {
env,
}).catch((err) => {
throw new Error(`pdftract failed (exit ${err.code}): ${err.stderr}`);
});
return JSON.parse(stdout);
}
/**
* Extract with password via --password-stdin (TH-07 compliant).
* @param {string} pdfPath
* @param {string} password
* @returns {Promise<object>}
*/
async function extractPdfPasswordStdin(pdfPath, password) {
const { execFile } = require("node:child_process");
return new Promise((resolve, reject) => {
const proc = execFile("pdftract", ["extract", "--password-stdin", pdfPath]);
let stdout = "";
let stderr = "";
proc.stdout.on("data", (data) => { stdout += data; });
proc.stderr.on("data", (data) => { stderr += data; });
proc.on("close", (code) => {
if (code !== 0) {
reject(new Error(`pdftract failed (exit ${code}): ${stderr}`));
} else {
resolve(JSON.parse(stdout));
}
});
// Write password to stdin, followed by newline
proc.stdin.write(password + "\n");
proc.stdin.end();
});
}
/** Concatenate all block text across every page. */
function fullText(data) {
return data.pages
.flatMap((page) => page.blocks.map((b) => b.text))
.join("\n");
}
/** Return concatenated block text for a single page (1-indexed). */
function pageText(data, pageNumber) {
const page = data.pages.find((p) => p.page === pageNumber);
if (!page) throw new Error(`Page ${pageNumber} not found`);
return page.blocks.map((b) => b.text).join("\n");
}
// Usage
const data = await extractPdfSubprocess(process.argv[2]);
console.log("Title :", data.metadata.title ?? "(none)");
console.log("Pages :", data.metadata.page_count);
console.log("\n--- Full text ---");
console.log(fullText(data));
console.log("\n--- Page 1 ---");
console.log(pageText(data, 1));
```
### HTTP (native fetch)
```js
// Node.js 18+ — fetch is available globally; no extra dependencies required.
import { readFile } from "node:fs/promises";
const PDFTRACT_URL = "http://localhost:8080";
/**
* POST a PDF to pdftract serve.
* @param {string} pdfPath
* @param {string} [password] Optional PDF password (sent as form field)
* @returns {Promise<object>} Parsed pdftract JSON
*/
async function extractPdfHttp(pdfPath, password) {
const bytes = await readFile(pdfPath);
const blob = new Blob([bytes], { type: "application/pdf" });
const form = new FormData();
form.append("file", blob, pdfPath);
if (password) {
// TH-07: Password via form field is allowed.
form.append("password", password);
}
const res = await fetch(`${PDFTRACT_URL}/extract`, {
method: "POST",
body: form,
});
if (!res.ok) {
const body = await res.text();
throw new Error(`pdftract HTTP ${res.status}: ${body}`);
}
return res.json();
}
function fullText(data) {
return data.pages
.flatMap((page) => page.blocks.map((b) => b.text))
.join("\n");
}
function pageText(data, pageNumber) {
const page = data.pages.find((p) => p.page === pageNumber);
if (!page) throw new Error(`Page ${pageNumber} not found`);
return page.blocks.map((b) => b.text).join("\n");
}
// Usage
const data = await extractPdfHttp(process.argv[2]);
console.log(fullText(data));
```
---
## 3. Go
> **When to prefer subprocess:** CLI utilities or single-binary deployments where you want zero network overhead.
> **When to prefer HTTP:** Go services handling concurrent requests — spin up pdftract serve once and hit it from multiple goroutines.
### Subprocess (os/exec)
```go
package main
import (
"encoding/json"
"fmt"
"log"
"os"
"os/exec"
"strings"
)
// extractSubprocess runs `pdftract extract <path>` and returns the parsed result.
// If password is non-empty, it is passed via PDFTRACT_PASSWORD env var (TH-07 compliant).
func extractSubprocess(pdfPath string, password string) (*PDFTractResult, error) {
cmd := exec.Command("pdftract", "extract", pdfPath)
if password != "" {
// TH-07: Pass password via env var, NOT via --password flag.
cmd.Env = append(os.Environ(), "PDFTRACT_PASSWORD="+password)
}
out, err := cmd.Output()
if err != nil {
if exitErr, ok := err.(*exec.ExitError); ok {
return nil, fmt.Errorf("pdftract failed: %s", string(exitErr.Stderr))
}
return nil, fmt.Errorf("exec error: %w", err)
}
var result PDFTractResult
if err := json.Unmarshal(out, &result); err != nil {
return nil, fmt.Errorf("json parse error: %w", err)
}
return &result, nil
}
type Span struct {
Text string `json:"text"`
BBox [4]float64 `json:"bbox"`
Font string `json:"font"`
Size float64 `json:"size"`
Confidence float64 `json:"confidence"`
}
type Block struct {
Kind string `json:"kind"`
Text string `json:"text"`
BBox [4]float64 `json:"bbox"`
}
type Page struct {
Page int `json:"page"`
Spans []Span `json:"spans"`
Blocks []Block `json:"blocks"`
}
type Metadata struct {
Title string `json:"title"`
Author string `json:"author"`
PageCount int `json:"page_count"`
}
type PDFTractResult struct {
Pages []Page `json:"pages"`
Metadata Metadata `json:"metadata"`
}
// extractSubprocess runs `pdftract extract <path>` and returns the parsed result.
func extractSubprocess(pdfPath string) (*PDFTractResult, error) {
out, err := exec.Command("pdftract", "extract", pdfPath).Output()
if err != nil {
if exitErr, ok := err.(*exec.ExitError); ok {
return nil, fmt.Errorf("pdftract failed: %s", string(exitErr.Stderr))
}
return nil, fmt.Errorf("exec error: %w", err)
}
var result PDFTractResult
if err := json.Unmarshal(out, &result); err != nil {
return nil, fmt.Errorf("json parse error: %w", err)
}
return &result, nil
}
// FullText concatenates all block text across every page.
func (r *PDFTractResult) FullText() string {
var sb strings.Builder
for _, page := range r.Pages {
for _, block := range page.Blocks {
sb.WriteString(block.Text)
sb.WriteByte('\n')
}
}
return sb.String()
}
// PageText returns concatenated block text for a single page (1-indexed).
func (r *PDFTractResult) PageText(pageNumber int) (string, error) {
for _, page := range r.Pages {
if page.Page == pageNumber {
var sb strings.Builder
for _, block := range page.Blocks {
sb.WriteString(block.Text)
sb.WriteByte('\n')
}
return sb.String(), nil
}
}
return "", fmt.Errorf("page %d not found", pageNumber)
}
func main() {
if len(os.Args) < 2 {
log.Fatal("usage: program <file.pdf>")
}
result, err := extractSubprocess(os.Args[1])
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
fmt.Printf("Title : %s\n", result.Metadata.Title)
fmt.Printf("Pages : %d\n", result.Metadata.PageCount)
fmt.Println("\n--- Full text ---")
fmt.Println(result.FullText())
p1, err := result.PageText(1)
if err != nil {
log.Printf("page 1: %v", err)
} else {
fmt.Println("--- Page 1 ---")
fmt.Println(p1)
}
}
```
### HTTP (net/http)
```go
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"mime/multipart"
"net/http"
"net/url"
"os"
"path/filepath"
)
const pdftractURL = "http://localhost:8080"
// extractHTTP POSTs a PDF file to pdftract serve.
// If password is non-empty, it is sent as a multipart form field (TH-07 compliant).
func extractHTTP(pdfPath string, password string) (*PDFTractResult, error) {
f, err := os.Open(pdfPath)
if err != nil {
return nil, fmt.Errorf("open file: %w", err)
}
defer f.Close()
var buf bytes.Buffer
mw := multipart.NewWriter(&buf)
part, err := mw.CreateFormFile("file", filepath.Base(pdfPath))
if err != nil {
return nil, fmt.Errorf("create form file: %w", err)
}
if _, err := io.Copy(part, f); err != nil {
return nil, fmt.Errorf("copy file: %w", err)
}
if password != "" {
// TH-07: Password via form field is allowed.
err = mw.WriteField("password", password)
if err != nil {
return nil, fmt.Errorf("write password field: %w", err)
}
}
mw.Close()
resp, err := http.Post(
pdftractURL+"/extract",
mw.FormDataContentType(),
&buf,
)
if err != nil {
return nil, fmt.Errorf("http post: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("pdftract HTTP %d: %s", resp.StatusCode, body)
}
var result PDFTractResult
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return nil, fmt.Errorf("json decode: %w", err)
}
return &result, nil
}
func main() {
if len(os.Args) < 2 {
log.Fatal("usage: program <file.pdf>")
}
result, err := extractHTTP(os.Args[1])
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
fmt.Println(result.FullText())
}
```
---
## 4. Ruby
> **When to prefer subprocess:** Rake tasks, standalone scripts, or Rails background jobs without a persistent pdftract process.
> **When to prefer HTTP:** Sidekiq workers or Rails requests — keep pdftract serve running as a separate process and hit it over loopback.
### Subprocess (Open3)
```ruby
require "open3"
require "json"
# Extract text from a PDF via subprocess.
# Returns a Hash parsed from pdftract's JSON output.
# If password is provided, it is passed via env var (TH-07 compliant).
def extract_pdf_subprocess(pdf_path, password: nil)
env = {}
env["PDFTRACT_PASSWORD"] = password if password
stdout, stderr, status = Open3.capture3(
env,
"pdftract", "extract", pdf_path
)
unless status.success?
raise "pdftract failed (exit #{status.exitstatus}): #{stderr.strip}"
end
JSON.parse(stdout)
end
# Extract with password via --password-stdin (TH-07 compliant).
def extract_pdf_password_stdin(pdf_path, password)
require "open3"
require "json"
# Pass password via stdin; Open3 with :stdin_data is the cleanest way.
stdout, stderr, status = Open3.capture3(
"pdftract", "extract", "--password-stdin", pdf_path,
stdin_data: password + "\n"
)
unless status.success?
raise "pdftract failed (exit #{status.exitstatus}): #{stderr.strip}"
end
JSON.parse(stdout)
end
# Concatenate all block text across every page.
def full_text(data)
data["pages"]
.flat_map { |page| page["blocks"].map { |b| b["text"] } }
.join("\n")
end
# Return concatenated block text for a single page (1-indexed).
def page_text(data, page_number)
page = data["pages"].find { |p| p["page"] == page_number }
raise "Page #{page_number} not found" unless page
page["blocks"].map { |b| b["text"] }.join("\n")
end
# Usage
pdf_path = ARGV[0] || raise("Usage: ruby extract.rb <file.pdf>")
data = extract_pdf_subprocess(pdf_path)
puts "Title : #{data.dig("metadata", "title") || "(none)"}"
puts "Pages : #{data.dig("metadata", "page_count")}"
puts
puts "--- Full text ---"
puts full_text(data)
puts
puts "--- Page 1 ---"
puts page_text(data, 1)
```
### HTTP (net/http)
```ruby
require "net/http"
require "json"
PDFTRACT_URL = URI("http://localhost:8080/extract")
# POST a PDF file to pdftract serve.
# If password is provided, it is sent as a multipart form field (TH-07 compliant).
def extract_pdf_http(pdf_path, password: nil)
boundary = "----pdftract#{rand(0xFFFFFF).to_s(16)}"
body = build_multipart(pdf_path, boundary, password:)
http = Net::HTTP.new(PDFTRACT_URL.host, PDFTRACT_URL.port)
http.read_timeout = 60
request = Net::HTTP::Post.new(PDFTRACT_URL.path)
request["Content-Type"] = "multipart/form-data; boundary=#{boundary}"
request.body = body
response = http.request(request)
raise "pdftract HTTP #{response.code}: #{response.body}" unless response.is_a?(Net::HTTPSuccess)
JSON.parse(response.body)
end
def build_multipart(pdf_path, boundary, password: nil)
crlf = "\r\n"
pdf_data = File.binread(pdf_path)
filename = File.basename(pdf_path)
parts = [
"--#{boundary}#{crlf}",
"Content-Disposition: form-data; name=\"file\"; filename=\"#{filename}\"#{crlf}",
"Content-Type: application/pdf#{crlf}",
crlf,
pdf_data,
]
if password
# TH-07: Password via form field is allowed.
parts.concat([
"#{crlf}--#{boundary}#{crlf}",
"Content-Disposition: form-data; name=\"password\"#{crlf}",
crlf,
password,
])
end
parts.concat([
"#{crlf}--#{boundary}--#{crlf}",
])
parts.join
end
def full_text(data)
data["pages"]
.flat_map { |page| page["blocks"].map { |b| b["text"] } }
.join("\n")
end
def page_text(data, page_number)
page = data["pages"].find { |p| p["page"] == page_number }
raise "Page #{page_number} not found" unless page
page["blocks"].map { |b| b["text"] }.join("\n")
end
# Usage
pdf_path = ARGV[0] || raise("Usage: ruby extract_http.rb <file.pdf>")
data = extract_pdf_http(pdf_path)
puts full_text(data)
```
---
## 5. Java
> **When to prefer subprocess:** batch jobs or standalone utilities. ProcessBuilder is simple and avoids a network stack.
> **When to prefer HTTP:** Spring Boot services or multi-threaded apps — pdftract serve handles concurrent requests, while subprocess creates a new process per call.
Requires Java 11+. No external dependencies — uses only the standard library.
### Subprocess (ProcessBuilder)
```java
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
/**
* Invokes pdftract via subprocess and parses the JSON result.
*
* Dependency (Maven):
* <dependency>
* <groupId>com.fasterxml.jackson.core</groupId>
* <artifactId>jackson-databind</artifactId>
* <version>2.17.0</version>
* </dependency>
*
* If you prefer no dependencies, replace ObjectMapper with org.json or
* a manual string parse — the structure is straightforward.
*/
public class PdftractSubprocess {
private static final ObjectMapper MAPPER = new ObjectMapper();
/**
* Extract text from a PDF.
* @param pdfPath Path to the PDF file.
* @param password Optional PDF password (TH-07: passed via env var).
*/
public static JsonNode extract(String pdfPath, String password) throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder("pdftract", "extract", pdfPath);
pb.redirectErrorStream(false); // keep stderr separate
if (password != null && !password.isEmpty()) {
// TH-07: Pass password via env var, NOT via --password flag.
Map<String, String> env = pb.environment();
env.put("PDFTRACT_PASSWORD", password);
}
Process process = pb.start();
byte[] stdout = process.getInputStream().readAllBytes();
byte[] stderr = process.getErrorStream().readAllBytes();
int exit = process.waitFor();
if (exit != 0) {
throw new IOException(
"pdftract failed (exit " + exit + "): " + new String(stderr).strip()
);
}
return MAPPER.readTree(stdout);
}
/** Concatenate all block text across every page. */
public static String fullText(JsonNode data) {
List<String> parts = new ArrayList<>();
for (JsonNode page : data.get("pages")) {
for (JsonNode block : page.get("blocks")) {
parts.add(block.get("text").asText());
}
}
return String.join("\n", parts);
}
/** Return concatenated block text for a single page (1-indexed). */
public static String pageText(JsonNode data, int pageNumber) {
for (JsonNode page : data.get("pages")) {
if (page.get("page").asInt() == pageNumber) {
List<String> parts = new ArrayList<>();
for (JsonNode block : page.get("blocks")) {
parts.add(block.get("text").asText());
}
return String.join("\n", parts);
}
}
throw new IllegalArgumentException("Page " + pageNumber + " not found");
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: PdftractSubprocess <file.pdf>");
System.exit(1);
}
JsonNode data = extract(args[0]);
JsonNode meta = data.get("metadata");
System.out.println("Title : " + meta.path("title").asText("(none)"));
System.out.println("Pages : " + meta.get("page_count").asInt());
System.out.println("\n--- Full text ---");
System.out.println(fullText(data));
System.out.println("\n--- Page 1 ---");
System.out.println(pageText(data, 1));
}
}
```
### HTTP (java.net.http.HttpClient, Java 11+)
```java
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
public class PdftractHttp {
private static final String PDFTRACT_URL = "http://localhost:8080";
private static final ObjectMapper MAPPER = new ObjectMapper();
private static final HttpClient CLIENT = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
/**
* Extract text from a PDF via HTTP.
* @param pdfPath Path to the PDF file.
* @param password Optional PDF password (TH-07: sent as form field).
*/
public static JsonNode extract(String pdfPath, String password) throws IOException, InterruptedException {
Path path = Path.of(pdfPath);
byte[] pdfBytes = Files.readAllBytes(path);
String filename = path.getFileName().toString();
String boundary = UUID.randomUUID().toString().replace("-", "");
// Build multipart/form-data body manually (no external library needed)
String crlf = "\r\n";
StringBuilder bodyBuilder = new StringBuilder();
// File part
bodyBuilder.append("--").append(boundary).append(crlf);
bodyBuilder.append("Content-Disposition: form-data; name=\"file\"; filename=\"")
.append(filename).append("\"").append(crlf);
bodyBuilder.append("Content-Type: application/pdf").append(crlf);
bodyBuilder.append(crlf);
byte[] headerBytes = bodyBuilder.toString().getBytes(StandardCharsets.UTF_8);
byte[] footerBytes = (crlf + "--" + boundary + "--" + crlf).getBytes(StandardCharsets.UTF_8);
byte[] passwordBytes = new byte[0];
if (password != null && !password.isEmpty()) {
// TH-07: Password via form field is allowed.
String passwordPart = crlf + "--" + boundary + crlf
+ "Content-Disposition: form-data; name=\"password\"" + crlf
+ crlf
+ password;
passwordBytes = passwordPart.getBytes(StandardCharsets.UTF_8);
}
byte[] body = new byte[headerBytes.length + pdfBytes.length + passwordBytes.length + footerBytes.length];
int pos = 0;
System.arraycopy(headerBytes, 0, body, pos, headerBytes.length);
pos += headerBytes.length;
System.arraycopy(pdfBytes, 0, body, pos, pdfBytes.length);
pos += pdfBytes.length;
System.arraycopy(passwordBytes, 0, body, pos, passwordBytes.length);
pos += passwordBytes.length;
System.arraycopy(footerBytes, 0, body, pos, footerBytes.length);
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(PDFTRACT_URL + "/extract"))
.timeout(Duration.ofSeconds(60))
.header("Content-Type", "multipart/form-data; boundary=" + boundary)
.POST(HttpRequest.BodyPublishers.ofByteArray(body))
.build();
HttpResponse<String> response = CLIENT.send(
request, HttpResponse.BodyHandlers.ofString()
);
if (response.statusCode() != 200) {
throw new IOException(
"pdftract HTTP " + response.statusCode() + ": " + response.body()
);
}
return MAPPER.readTree(response.body());
}
public static String fullText(JsonNode data) {
List<String> parts = new ArrayList<>();
for (JsonNode page : data.get("pages")) {
for (JsonNode block : page.get("blocks")) {
parts.add(block.get("text").asText());
}
}
return String.join("\n", parts);
}
public static String pageText(JsonNode data, int pageNumber) {
for (JsonNode page : data.get("pages")) {
if (page.get("page").asInt() == pageNumber) {
List<String> parts = new ArrayList<>();
for (JsonNode block : page.get("blocks")) {
parts.add(block.get("text").asText());
}
return String.join("\n", parts);
}
}
throw new IllegalArgumentException("Page " + pageNumber + " not found");
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: PdftractHttp <file.pdf>");
System.exit(1);
}
JsonNode data = extract(args[0]);
System.out.println(fullText(data));
}
}
```
---
## 6. Rust
> **When to prefer subprocess:** CLI tools or single-threaded batch processors — zero extra dependencies beyond `serde_json`.
> **When to prefer HTTP:** Async Tokio services — `reqwest` is non-blocking and naturally fits async Rust workloads.
### Subprocess (std::process::Command)
Add to `Cargo.toml`:
```toml
[dependencies]
serde = { version = "1", features = ["derive"] }
serde_json = "1"
```
```rust
use serde::Deserialize;
use std::process::Command;
use std::collections::HashMap as EnvMap;
#[derive(Debug, Deserialize)]
struct Span {
pub text: String,
pub bbox: [f64; 4],
pub font: String,
pub size: f64,
pub confidence: f64,
}
#[derive(Debug, Deserialize)]
struct Block {
pub kind: String,
pub text: String,
pub bbox: [f64; 4],
}
#[derive(Debug, Deserialize)]
struct Page {
pub page: u32,
pub spans: Vec<Span>,
pub blocks: Vec<Block>,
}
#[derive(Debug, Deserialize)]
struct Metadata {
pub title: Option<String>,
pub author: Option<String>,
pub page_count: u32,
}
#[derive(Debug, Deserialize)]
struct PdftractResult {
pub pages: Vec<Page>,
pub metadata: Metadata,
}
impl PdftractResult {
/// Concatenate all block text across every page.
pub fn full_text(&self) -> String {
self.pages
.iter()
.flat_map(|p| p.blocks.iter().map(|b| b.text.as_str()))
.collect::<Vec<_>>()
.join("\n")
}
/// Return concatenated block text for a single page (1-indexed).
pub fn page_text(&self, page_number: u32) -> Option<String> {
self.pages
.iter()
.find(|p| p.page == page_number)
.map(|p| {
p.blocks
.iter()
.map(|b| b.text.as_str())
.collect::<Vec<_>>()
.join("\n")
})
}
}
/// Extract text from a PDF via subprocess.
/// If password is provided, it is passed via env var (TH-07 compliant).
fn extract_subprocess(pdf_path: &str, password: Option<&str>) -> Result<PdftractResult, Box<dyn std::error::Error>> {
let mut cmd = Command::new("pdftract");
cmd.args(["extract", pdf_path]);
if let Some(pwd) = password {
// TH-07: Pass password via env var, NOT via --password flag.
cmd.env("PDFTRACT_PASSWORD", pwd);
}
let output = cmd.output()?;
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
return Err(format!(
"pdftract failed (exit {:?}): {}",
output.status.code(),
stderr.trim()
)
.into());
}
let result: PdftractResult = serde_json::from_slice(&output.stdout)?;
Ok(result)
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let pdf_path = std::env::args()
.nth(1)
.ok_or("usage: program <file.pdf>")?;
let result = extract_subprocess(&pdf_path)?;
println!("Title : {}", result.metadata.title.as_deref().unwrap_or("(none)"));
println!("Pages : {}", result.metadata.page_count);
println!("\n--- Full text ---");
println!("{}", result.full_text());
if let Some(text) = result.page_text(1) {
println!("\n--- Page 1 ---");
println!("{text}");
}
Ok(())
}
```
### HTTP (reqwest)
Add to `Cargo.toml`:
```toml
[dependencies]
serde = { version = "1", features = ["derive"] }
serde_json = "1"
reqwest = { version = "0.12", features = ["multipart"] }
tokio = { version = "1", features = ["full"] }
```
```rust
use reqwest::multipart;
use serde::Deserialize;
use std::path::Path;
// Re-use the same structs from the subprocess example above.
// (PdftractResult, Page, Block, Span, Metadata — copy them in)
const PDFTRACT_URL: &str = "http://localhost:8080";
/// Extract text from a PDF via HTTP.
/// If password is provided, it is sent as a multipart form field (TH-07 compliant).
async fn extract_http(pdf_path: &str, password: Option<&str>) -> Result<PdftractResult, Box<dyn std::error::Error>> {
let bytes = tokio::fs::read(pdf_path).await?;
let filename = Path::new(pdf_path)
.file_name()
.and_then(|n| n.to_str())
.unwrap_or("document.pdf")
.to_owned();
let mut form = multipart::Form::new();
let file_part = multipart::Part::bytes(bytes)
.file_name(filename)
.mime_str("application/pdf")?;
form = form.part("file", file_part);
if let Some(pwd) = password {
// TH-07: Password via form field is allowed.
form = form.text("password", pwd.to_string());
}
let client = reqwest::Client::new();
let response = client
.post(format!("{PDFTRACT_URL}/extract"))
.multipart(form)
.timeout(std::time::Duration::from_secs(60))
.send()
.await?;
if !response.status().is_success() {
let status = response.status();
let body = response.text().await.unwrap_or_default();
return Err(format!("pdftract HTTP {status}: {body}").into());
}
let result: PdftractResult = response.json().await?;
Ok(result)
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let pdf_path = std::env::args()
.nth(1)
.ok_or("usage: program <file.pdf>")?;
let result = extract_http(&pdf_path).await?;
println!("{}", result.full_text());
if let Some(text) = result.page_text(1) {
println!("\n--- Page 1 ---");
println!("{text}");
}
Ok(())
}
```
---
## Parsing `--progress-json` Events
When `--progress-json` is passed, pdftract emits newline-delimited JSON objects to stderr. SDKs can parse these events to show progress bars, detect errors early, or log structured diagnostics.
### Python
```python
import subprocess
import json
from typing import Any
ProgressEvent = dict[str, Any]
def extract_with_progress(pdf_path: str) -> dict:
"""Extract while parsing progress events from stderr."""
cmd = ["pdftract", "extract", "--progress-json", pdf_path]
# stderr is line-buffered; each line is either JSON or a human log.
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
result: dict | None = None
for line in process.stderr:
line = line.rstrip("\n")
if not line:
continue
# Try to parse as JSON; if it fails, it's a human log line.
try:
event: ProgressEvent = json.loads(line)
event_type = event.get("event")
if event_type == "open":
print(f"Opening {event['path']} (fingerprint: {event['fingerprint'][:16]}...)")
elif event_type == "page_started":
print(f"Page {event['page']}/{event['total']}...")
elif event_type == "page_completed":
print(f" → {event['span_count']} spans, {event['block_count']} blocks")
elif event_type == "ocr_started":
print(f" OCR (page {event['page']}, lang={event['lang']})...")
elif event_type == "ocr_completed":
print(f" OCR done in {event['duration_ms']}ms")
elif event_type == "profile_matched":
print(f"Profile: {event['profile']} (priority {event['priority']})")
elif event_type == "password_received":
# TH-07: The password value is NEVER in the event.
print(f"Password received via {event['source']}")
elif event_type == "completed":
print(f"Done in {event['duration_ms']}ms, {event['page_count']} pages")
elif event_type == "error":
print(f"Error: {event['code']} - {event['message']}")
except json.JSONDecodeError:
# Human-readable log line (optional: ignore or log to file)
print(f"[log] {line}")
stdout, _ = process.communicate()
if process.returncode != 0:
raise RuntimeError(f"pdftract failed with exit {process.returncode}")
return json.loads(stdout)
```
### Node.js
```js
import { execFile } from "node:child_process";
async function extractWithProgress(pdfPath) {
const proc = execFile("pdftract", ["extract", "--progress-json", pdfPath]);
let stdout = "";
proc.stderr.on("data", (data) => {
for (const line of data.toString().split("\n")) {
if (!line.trim()) continue;
try {
const event = JSON.parse(line);
switch (event.event) {
case "open":
console.log(`Opening ${event.path}`);
break;
case "page_started":
console.log(`Page ${event.page}/${event.total}...`);
break;
case "page_completed":
console.log(` → ${event.span_count} spans, ${event.block_count} blocks`);
break;
case "ocr_started":
console.log(` OCR (page ${event.page}, lang=${event.lang})...`);
break;
case "ocr_completed":
console.log(` OCR done in ${event.duration_ms}ms`);
break;
case "profile_matched":
console.log(`Profile: ${event.profile} (priority ${event.priority})`);
break;
case "password_received":
console.log(`Password received via ${event.source}`);
break;
case "completed":
console.log(`Done in ${event.duration_ms}ms, ${event.page_count} pages`);
break;
case "error":
console.error(`Error: ${event.code} - ${event.message}`);
break;
}
} catch (e) {
// Not JSON — human log line
console.log(`[log] ${line}`);
}
}
});
return new Promise((resolve, reject) => {
proc.stdout.on("data", (d) => { stdout += d; });
proc.on("close", (code) => {
if (code !== 0) {
reject(new Error(`pdftract failed with exit ${code}`));
} else {
resolve(JSON.parse(stdout));
}
});
});
}
```
### Rust
```rust
use std::process::{Command, Stdio};
use std::io::{BufRead, BufReader};
use serde_json::Value;
fn extract_with_progress(pdf_path: &str) -> Result<PdftractResult, Box<dyn std::error::Error>> {
let mut child = Command::new("pdftract")
.args(["extract", "--progress-json", pdf_path])
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()?;
let stderr = child.stderr.take().expect("stderr");
let reader = BufReader::new(stderr);
for line in reader.lines() {
let line = line?;
if line.is_empty() {
continue;
}
// Try to parse as JSON
if let Ok(event) = serde_json::from_str::<Value>(&line) {
let event_type = event.get("event").and_then(|v| v.as_str());
match event_type {
Some("open") => {
let path = event.get("path").and_then(|v| v.as_str()).unwrap_or("?");
println!("Opening {}", path);
}
Some("page_started") => {
let page = event.get("page").and_then(|v| v.as_u64()).unwrap_or(0);
let total = event.get("total").and_then(|v| v.as_u64()).unwrap_or(0);
println!("Page {}/{}...", page, total);
}
Some("page_completed") => {
let spans = event.get("span_count").and_then(|v| v.as_u64()).unwrap_or(0);
let blocks = event.get("block_count").and_then(|v| v.as_u64()).unwrap_or(0);
println!(" → {} spans, {} blocks", spans, blocks);
}
Some("ocr_started") => {
let page = event.get("page").and_then(|v| v.as_u64()).unwrap_or(0);
let lang = event.get("lang").and_then(|v| v.as_str()).unwrap_or("?");
println!(" OCR (page {}, lang={})...", page, lang);
}
Some("ocr_completed") => {
let ms = event.get("duration_ms").and_then(|v| v.as_u64()).unwrap_or(0);
println!(" OCR done in {}ms", ms);
}
Some("profile_matched") => {
let profile = event.get("profile").and_then(|v| v.as_str()).unwrap_or("?");
let priority = event.get("priority").and_then(|v| v.as_u64()).unwrap_or(0);
println!("Profile: {} (priority {})", profile, priority);
}
Some("password_received") => {
let source = event.get("source").and_then(|v| v.as_str()).unwrap_or("?");
println!("Password received via {}", source);
}
Some("completed") => {
let ms = event.get("duration_ms").and_then(|v| v.as_u64()).unwrap_or(0);
let pages = event.get("page_count").and_then(|v| v.as_u64()).unwrap_or(0);
println!("Done in {}ms, {} pages", ms, pages);
}
Some("error") => {
let code = event.get("code").and_then(|v| v.as_str()).unwrap_or("?");
let msg = event.get("message").and_then(|v| v.as_str()).unwrap_or("?");
eprintln!("Error: {} - {}", code, msg);
}
_ => {
// Unknown event type or malformed JSON
println!("[log] {}", line);
}
}
} else {
// Not JSON — human log line
println!("[log] {}", line);
}
}
let output = child.wait_with_output()?;
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
return Err(format!("pdftract failed: {}", stderr).into());
}
let result: PdftractResult = serde_json::from_slice(&output.stdout)?;
Ok(result)
}
```
---
## 7. Shell / Bash
> **When to prefer direct invocation:** shell scripts, cron jobs, CI pipelines, or any context where you have direct access to the binary.
> **When to prefer curl:** when pdftract is running as a shared service on another host, inside a container, or when you want to avoid installing the binary locally.
### Direct Invocation
```bash
#!/usr/bin/env bash
set -euo pipefail
PDF="${1:?Usage: $0 <file.pdf>}"
# --- JSON output ---
json=$(pdftract extract "$PDF")
# Full text via jq: collect all block text across all pages
full_text=$(echo "$json" | jq -r '[.pages[].blocks[].text] | join("\n")')
# Per-page text (page 1)
page1_text=$(echo "$json" | jq -r '.pages[] | select(.page == 1) | [.blocks[].text] | join("\n")')
# Metadata
title=$(echo "$json" | jq -r '.metadata.title // "(none)"')
pages=$(echo "$json" | jq -r '.metadata.page_count')
echo "Title : $title"
echo "Pages : $pages"
echo
echo "--- Full text ---"
echo "$full_text"
echo
echo "--- Page 1 ---"
echo "$page1_text"
# --- Plain text output (no jq needed) ---
plain=$(pdftract extract "$PDF" --text)
echo
echo "--- Plain text (--text flag) ---"
echo "$plain"
# --- Write JSON to file ---
pdftract extract "$PDF" --output "/tmp/$(basename "$PDF" .pdf).json"
echo "JSON written to /tmp/$(basename "$PDF" .pdf).json"
```
### curl (HTTP)
```bash
#!/usr/bin/env bash
set -euo pipefail
PDF="${1:?Usage: $0 <file.pdf>}"
PDFTRACT_URL="${PDFTRACT_URL:-http://localhost:8080}"
# POST the PDF and capture the response; fail fast on HTTP errors.
json=$(curl --silent --show-error --fail \
--max-time 60 \
-F "file=@${PDF};type=application/pdf" \
"${PDFTRACT_URL}/extract")
# Full text via jq
full_text=$(echo "$json" | jq -r '[.pages[].blocks[].text] | join("\n")')
# Per-page text (page 1)
page1_text=$(echo "$json" | jq -r '.pages[] | select(.page == 1) | [.blocks[].text] | join("\n")')
# Metadata
title=$(echo "$json" | jq -r '.metadata.title // "(none)"')
pages=$(echo "$json" | jq -r '.metadata.page_count')
echo "Title : $title"
echo "Pages : $pages"
echo
echo "--- Full text ---"
echo "$full_text"
echo
echo "--- Page 1 ---"
echo "$page1_text"
# --- Save raw JSON ---
output_file="/tmp/$(basename "$PDF" .pdf).json"
echo "$json" > "$output_file"
echo "JSON saved to $output_file"
# --- Health check before submitting ---
# curl -sf "${PDFTRACT_URL}/health" > /dev/null \
# || { echo "pdftract serve is not running at ${PDFTRACT_URL}"; exit 1; }
```
### Batch processing with xargs / parallel
```bash
#!/usr/bin/env bash
# Process every PDF in a directory, writing one JSON file per PDF.
# Uses GNU parallel if available, otherwise xargs -P.
PDF_DIR="${1:?Usage: $0 <dir>}"
OUT_DIR="${2:-/tmp/pdftract-out}"
mkdir -p "$OUT_DIR"
extract_one() {
local pdf="$1"
local out="$OUT_DIR/$(basename "$pdf" .pdf).json"
pdftract extract "$pdf" --output "$out" && echo "OK $pdf" || echo "ERR $pdf"
}
export -f extract_one
export OUT_DIR
find "$PDF_DIR" -name "*.pdf" -print0 \
| xargs -0 -P 4 -I{} bash -c 'extract_one "$@"' _ {}
```