jedarden d9d21df157 docs(pdftract-653ah): add runbook integration for pdftract doctor

- Created docs/operations/manual-platform-smoke.md with comprehensive
  smoke test runbook for KU-12 quarterly manual platform testing
- Added troubleshooting table covering all 14 doctor checks
- Cross-referenced runbook from installation.md and quickstart.md
- Added CI gate test (doctor_runbook_coverage.rs) to verify
  troubleshooting table completeness

Acceptance criteria:
✓ Step 1: pdftract doctor as first section in runbook
✓ Troubleshooting table covers all FAIL-capable checks
✓ installation.md mentions pdftract doctor with runbook link
✓ quickstart.md uses pdftract doctor as first example command
✓ CI gate parses runbook and asserts all checks are present
✓ mdBook build succeeds
✓ No broken internal links

Closes: pdftract-653ah

2026-05-24 13:26:31 -04:00

12 KiB

Raw Blame History

Manual Platform Smoke Test (KU-12)

Purpose: This runbook is the canonical smoke test executed before each milestone release on at least one physical macOS machine and one Windows VM. Per KU-12, Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release.

Execution frequency: Quarterly (per KU-12) or before each milestone release.

Executor: Release lead or designated QA engineer.

Step 1: Validate the environment (pdftract doctor)

Before running any extractions, validate the deployment with:

pdftract doctor

Expected output on a fully-provisioned host

Check                         Status  Detail
────────────────────────────────────────────────────────────────────────────────
pdftract binary               OK      0.1.0 (git: abc1234)
Features: OCR, FULL_RENDER, PROFILES, SERVE, MCP, INSPECT, GREP, CACHE, RECEIPTS, MARKDOWN
tesseract install             OK      tesseract 5.3.0 found (major >= 5)
tesseract languages           OK      All required languages present: ["eng", "osd"]
leptonica install             OK      leptonica 1.82.0 found (>= 1.79)
libtiff                       OK      libtiff 4.4.0 found
libopenjp2                    OK      libopenjp2 2.5.0 found
pdfium native lib             OK      pdfium 6555 found (loaded from /usr/lib/x86_64-linux-gnu/libpdfium.so)
network reachability          OK      Network reachable: 200 in 0.23s
cache directory               OK      Layout version 1 (current) at /home/user/.cache/pdftract
profile search path           OK      All 9 profile(s) valid at /home/user/.config/pdftract/profiles
ulimit -n                     OK      File descriptor limit: 65536
available RAM                 OK      16384 MiB available
system locale                 OK      Locale 'en_US.UTF-8' (UTF-8)
temp dir writable             OK      Temp dir writable at /tmp
────────────────────────────────────────────────────────────────────────────────
15 OK, 0 WARN, 0 FAIL

Exit policy

Exit code 0: All checks OK or WARN (no FAIL). Deployment proceeds.
Exit code 1: At least one check reports FAIL. Deployment blocked; resolve FAIL rows before proceeding.

Any FAIL row blocks deployment. See the troubleshooting table below for each FAIL.

For CI integration

Use --json for machine-consumable output:

pdftract doctor --json | jq -e '.summary.fail == 0' || exit 1

Example JSON output:

{
  "summary": {
    "ok": 14,
    "warn": 1,
    "fail": 0,
    "total": 15
  },
  "checks": [
    {
      "name": "pdftract binary",
      "status": "ok",
      "detail": "0.1.0 (git: abc1234)\nFeatures: OCR, FULL_RENDER, ..."
    },
    ...
  ]
}

Interpreting WARN rows

WARN does not block deployment but should be tracked. Recommended action: open a tracking ticket per WARN row for resolution in the next patch release. Common WARN scenarios:

tesseract install (WARN): Tesseract 4.x detected. OCR results may have minor glyph errors. Plan upgrade to 5.x.
cache directory (WARN): Low disk space (< 1 GiB free). Monitor cache growth; add storage if needed.
ulimit -n (WARN): File descriptor limit between 512–1023. May hit limits with batch operations. Increase to 4096+.
available RAM (WARN): Less than 256 MiB free. Risk of OOM with large PDFs. Close other processes or add RAM.
system locale (WARN): Non-UTF-8 locale (e.g., C or POSIX). May cause encoding issues with non-ASCII text. Export LANG=en_US.UTF-8.

Troubleshooting

Check	Common cause	Fix
pdftract binary (FAIL)	Corrupted binary or build artifact	Reinstall: `cargo install pdftract --force` or `pip install --force-reinstall pdftract`
tesseract install (FAIL)	binary missing	`apt install tesseract-ocr` (Debian/Ubuntu) or `brew install tesseract` (macOS)
tesseract install (FAIL)	major <= 3	Upgrade to Tesseract 5.x via package manager
tesseract languages (FAIL)	eng pack missing	`apt install tesseract-ocr-eng` (Debian/Ubuntu) or `brew install tesseract-lang` (macOS)
tesseract languages (WARN)	optional langs missing	Install requested langs: `apt install tesseract-ocr-<lang>`
leptonica install (FAIL)	dev headers missing	`apt install libleptonica-dev` (Debian/Ubuntu) or `brew install leptonica` (macOS)
leptonica install (WARN)	older version (< 1.79)	Upgrade via package manager; WARN may be acceptable for basic OCR
libtiff (FAIL)	CCITT decoding library missing	`apt install libtiff-dev` (Debian/Ubuntu) or `brew install libtiff` (macOS)
libopenjp2 (FAIL)	JPEG2000 decoding library missing	`apt install libopenjp2-7-dev` (Debian/Ubuntu) or `brew install openjpeg` (macOS)
pdfium native lib (FAIL)	PDFium library not found	Install pdfium-render dependencies or compile with bundled PDFium
pdfium native lib (WARN)	older version (< 6555)	Upgrade PDFium; WARN may be acceptable for basic rendering
network reachability (FAIL)	no internet or firewall blocking	Check network connectivity; ensure HTTPS outbound is allowed
network reachability (WARN)	slow response (> 5s) or 3xx redirect	Check proxy settings; 3xx may indicate redirect loop
cache directory (FAIL)	not writable or layout incompatible	Check permissions: `ls -ld ~/.cache/pdftract`; fix ownership or recreate cache
cache directory (WARN)	low disk space (< 1 GiB free)	Clear cache: `pdftract cache clear` or add disk space
profile search path (FAIL)	YAML parse errors or forbidden keys	Run `pdftract profiles validate` for details; fix YAML syntax or remove secrets
profile search path (WARN)	directory empty or no YAML files	Add profiles to `~/.config/pdftract/profiles/` or specify `--profile-dir`
ulimit -n (FAIL)	< 512 (too low for many files)	Increase: `ulimit -n 4096` (temporary) or edit `/etc/security/limits.conf` (permanent)
ulimit -n (WARN)	512–1023 (may hit limits)	Increase to 4096+ for batch operations
available RAM (FAIL)	< 128 MiB free (risk of OOM)	Close other processes or add RAM
available RAM (WARN)	128–255 MiB free (low memory)	Monitor memory usage; add RAM if processing large PDFs
system locale (FAIL)	locale unset (LANG/LC_ALL empty)	Export: `export LANG=en_US.UTF-8` or add to `~/.bashrc`
system locale (WARN)	non-UTF-8 locale (C, POSIX, ISO-8859-1)	Export: `export LANG=en_US.UTF-8`
temp dir writable (FAIL)	TMPDIR/TMP/TEMP not writable	Check permissions: `ls -ld /tmp`; fix ownership or set `export TMPDIR=/var/tmp`
temp dir writable (WARN)	low disk space (< 100 MiB free)	Clear temp files or add disk space

Step 2: Verify extraction (basic smoke test)

After pdftract doctor passes with 0 FAIL, run a basic extraction smoke test:

# Use a fixture from the test suite
git clone https://github.com/jedarden/pdftract.git
cd pdftract
pdftract extract tests/fixtures/hello-world.pdf --output /tmp/smoke-test.json

# Verify JSON is valid
jq . /tmp/smoke-test.json > /dev/null && echo "PASS: extraction produced valid JSON"

Expected result: Valid JSON with at least pages, metadata, and spans keys.

Failure action: If extraction fails or produces invalid JSON, open a bug report with:

Platform (Linux/macOS/Windows, version)
pdftract --version output
pdftract doctor --json output
The fixture file used
The error message or invalid JSON

Step 3: Verify OCR (if ocr feature enabled)

If the binary was built with the ocr feature, test OCR on a scanned document:

# Use a scanned fixture
pdftract extract tests/fixtures/scanned-invoice.pdf --ocr --output /tmp/ocr-test.json

# Verify text was extracted
jq -e '.pages[0].spans | length > 0' /tmp/ocr-test.json && echo "PASS: OCR extracted text"

Expected result: JSON with extracted text from the scanned image.

Failure action: Check tesseract --version and tesseract --list-langs. If Tesseract works from CLI but pdftract fails, file a bug.

Step 4: Verify profiles (if profiles feature enabled)

If the binary was built with the profiles feature, test profile-based extraction:

# List available profiles
pdftract profiles list

# Run extraction with auto-detection
pdftract extract tests/fixtures/invoice.pdf --auto --output /tmp/profile-test.json

# Verify profile was applied
jq -e '.metadata.profile' /tmp/profile-test.json && echo "PASS: profile applied"

Expected result: JSON includes metadata.profile key with detected profile name.

Failure action: Check pdftract doctor output for profile search path check. Ensure profiles are in the correct directory.

Step 5: Verify cache (if cache feature enabled)

If the binary was built with the cache feature, test cache behavior:

# First extraction (cache miss)
time pdftract extract tests/fixtures/large-document.pdf --output /tmp/cache-test-1.json

# Second extraction (cache hit)
time pdftract extract tests/fixtures/large-document.pdf --output /tmp/cache-test-2.json

# Verify both outputs are identical
diff /tmp/cache-test-1.json /tmp/cache-test-2.json && echo "PASS: cache produced consistent results"

# Check cache stats
pdftract cache stats

Expected result: Second extraction is significantly faster; diff produces no output; cache stats reports > 0 entries.

Failure action: Check pdftract doctor output for cache directory check. Verify cache directory is writable and has sufficient space.

Platform-Specific Notes

macOS

Tesseract: brew install tesseract installs the binary; language packs are via brew install tesseract-lang.
libtiff/libopenjp2: brew install libtiff openjpeg.
ulimit: macOS default is often 256. Increase: ulimit -n 4096 (temporary) or add to ~/.zshrc.
locale: macOS default is often UTF-8. Verify with locale.

Windows

Tesseract: Install from UB Mannheim's builds.
libtiff/libopenjp2: Included with the pre-built binary (static linking).
ulimit: Not applicable on Windows (check is skipped).
locale: Set via Control Panel → Region → Administrative → Language for non-Unicode programs.

Linux

Tesseract: apt install tesseract-ocr tesseract-ocr-eng (Debian/Ubuntu).
libtiff/libopenjp2: apt install libtiff5-dev libopenjp2-7-dev.
ulimit: Check with ulimit -n. Increase via /etc/security/limits.conf or systemd drop-in.
locale: Set via /etc/locale.gen and locale-gen.

Completion Criteria

The smoke test passes when:

pdftract doctor reports 0 FAIL (WARN is acceptable if documented above).
Basic extraction produces valid JSON.
OCR extraction (if applicable) produces text from scanned images.
Profile extraction (if applicable) applies a profile.
Cache extraction (if applicable) shows speedup on second run.
All steps complete without crashes or hangs.

The smoke test fails when:

Any FAIL row in pdftract doctor.
Extraction crashes or produces invalid JSON.
OCR produces no text (all-empty spans).
Profile detection fails (no profile applied).
Cache produces inconsistent results between runs.

On failure: Open a bug report with the platform, pdftract --version, pdftract doctor --json, and reproduction steps. The milestone release is blocked until the failure is resolved.

References

Bead: pdftract-653ah (runbook integration)
Plan: Phase 6.10 pdftract doctor (lines 2479–2528 in /docs/plan/plan.md)
Sibling beads:
- pdftract-XXXXX (6.10.1: check registry)
- pdftract-XXXXX (6.10.3: exit code contract)
KU-12: Cross-platform test limitation (manual smoke test per release)

12 KiB Raw Blame History Unescape Escape