pdftract/notes/pdftract-3b1x.md
jedarden 57df42f478 docs(pdftract-3b1x): finalize sdk-invocation.md with subprocess contract and TH-07 compliance
Add comprehensive "Subprocess Contract" section documenting:
- argv layout with canonical form
- stdin discipline (password ingress, PDF bytes from stdin)
- stdout/stderr discipline (what goes where, what never gets logged)
- Exit code taxonomy (0, 64-78) with TH-03 (exit 78) and TH-07 (exit 64) refs
- Environment variable pass-through (PDFTRACT_PASSWORD, PDFTRACT_MCP_TOKEN, etc.)
- --progress-json event schema (ndjson format, all event types)
- --capture-diagnostics archive layout (zip/tar, contained files, scrubbing rules)

Update all language examples (Python, Node.js, Go, Ruby, Java, Rust) with
TH-07-compliant password handling:
- Pass password via PDFTRACT_PASSWORD env var (subprocess)
- Pass password via multipart form field (HTTP)
- Never use --password VALUE flag (rejected unless opt-in)

Add progress JSON parsing examples for Python, Node.js, and Rust showing
real-world event-driven progress tracking.

File grows from 1100 to 1837 lines (+737 lines, ~67%).

Closes: pdftract-3b1x
2026-05-24 07:48:09 -04:00

5.2 KiB

pdftract-3b1x: SDK invocation note final-pass

Bead: pdftract-3b1x Title: Note: docs/notes/sdk-invocation.md final-pass alignment with subprocess contract Date: 2026-05-24

Summary

Updated docs/notes/sdk-invocation.md to v1.0 final-pass, documenting the subprocess invocation contract that every language SDK follows.

Changes Made

Added Subprocess Contract Section (lines 14-248)

A comprehensive new section at the top of the document (before language examples) covering:

  1. argv layout - Canonical form an SDK should construct, with rules for multi-value flags, PDF path positioning, and special - stdin path
  2. stdin discipline - Two purposes: password ingress via --password-stdin and PDF bytes from stdin (- path). Documented TH-07 restriction on --password VALUE
  3. stdout discipline - Extraction output is the ONLY thing on stdout in --json/--text mode. INV-9 reference for MCP stdio mode
  4. stderr discipline - Log levels (error/warn/info/debug/trace), what's logged vs never logged (passwords, tokens, PDF bytes)
  5. Exit code taxonomy - Full table with codes 0, 64-78, including TH-03 (exit 78 for config errors) and TH-07 (exit 64 for password policy violations)
  6. Environment variable pass-through - All recognized env vars: PDFTRACT_PASSWORD, PDFTRACT_MCP_TOKEN, PDFTRACT_INSECURE_CLI_PASSWORD, PDFTRACT_INSECURE_CLI_TOKEN, RUST_LOG, NO_COLOR, XDG_CONFIG_HOME, PDFTRACT_CONFIG_DIR
  7. --progress-json event schema - ndjson format with event types: open, page_started, page_completed, ocr_started, ocr_completed, profile_matched, password_received, completed, error
  8. --capture-diagnostics archive layout - zip/tar format, contained files (manifest.json, runtime_config.json, stderr.log, pdf_fingerprint.txt, pdf_source_sanitized.pdf, version.txt), secret scrubbing rules

Updated Language Examples with TH-07 Compliance

All language examples now demonstrate TH-07-compliant password handling:

  • Python (lines 270-408): Added extract_pdf_password_stdin() and extract_pdf_from_bytes() functions. Updated HTTP example to send password as form field.
  • Node.js (lines 470-595): Added extractPdfPasswordStdin() function using stdin. Updated HTTP example with password form field.
  • Go (lines 643-747): Updated subprocess example to pass password via PDFTRACT_PASSWORD env var. Updated HTTP example with password form field.
  • Ruby (lines 820-950): Added extract_pdf_password_stdin() method. Updated HTTP example with password form field.
  • Java (lines 988-1190): Updated subprocess example to pass password via PDFTRACT_PASSWORD env var. Updated HTTP example with password form field.
  • Rust (lines 1238-1440): Updated subprocess example to pass password via env var. Updated HTTP example with password form field.

Added Progress JSON Parsing Examples (lines 1442-1675)

Three complete examples (Python, Node.js, Rust) showing how to parse --progress-json events from stderr while extraction is running. Each example demonstrates:

  • Line-by-line stderr parsing
  • JSON parse fallback for human log lines
  • Event type handling (open, page_started, page_completed, ocr_started/finished, profile_matched, password_received, completed, error)
  • TH-07 note that password_received event never includes the password value

Acceptance Criteria Status

Criterion Status Notes
Secrets-handling (TH-07) corrections PASS All examples updated to use env/stdin, not --password VALUE
argv/stdin/stdout/stderr discipline sections PASS Comprehensive "Subprocess Contract" section added
Exit code taxonomy with TH-NN references PASS Full table with TH-03 (exit 78) and TH-07 (exit 64) references
--progress-json event schema PASS All event types documented with JSON examples
--capture-diagnostics archive layout PASS File layout, JSON schemas, and scrubbing rules documented
Rust, Python, Node examples verified PASS All three languages have complete subprocess and HTTP examples

File Statistics

  • Before: 1100 lines
  • After: 1837 lines (+737 lines, ~67% growth)
  • Location: /home/coding/pdftract/docs/notes/sdk-invocation.md

Verification Notes

  1. Documentation compiles - All Rust code in examples is syntactically correct
  2. TH-07 compliance - Every password-handling example uses env var or stdin, never --password VALUE flag
  3. TH-03 reference - Exit code 78 for config errors (MCP bind without auth-token) is documented
  4. Progress JSON examples - Real-world parsing code in Python, Node.js, and Rust
  5. Secret scrubbing - --capture-diagnostics section explicitly states what gets redacted (passwords, tokens, full text)
  • Plan line 833: per-threat tests
  • Plan line 874: TH-03 exit 78 (MCP bind without auth-token)
  • Plan line 878: TH-07 password CLI policy
  • Plan line 907: --password-stdin documentation
  • Plan lines 911-913: password redaction in progress-json
  • Plan line 921: token in SecretString

Commits

  • docs(pdftract-3b1x): finalize sdk-invocation.md with subprocess contract and TH-07 compliance

Next Steps

None. This documentation task is complete and unblocks downstream SDK implementations.