pdftract/benches/competitors/run-pdfminer.sh
jedarden 857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00

27 lines
617 B
Bash
Executable file

#!/bin/bash
# Wrapper for pdfminer.six text extraction
# Usage: run-pdfminer.sh <pdf-file>
set -euo pipefail
PDF_FILE="$1"
if [ ! -f "$PDF_FILE" ]; then
echo "ERROR: File not found: $PDF_FILE" >&2
exit 1
fi
# Run pdfminer.six high-level text extraction
# -t: text extraction mode
# -o: output to stdout (default)
python3 -c "
import sys
from pdfminer.high_level import extract_text
try:
text = extract_text('$PDF_FILE')
# Write to stdout to ensure we process the full extraction
sys.stdout.write(text)
except Exception as e:
sys.stderr.write(f'ERROR: {e}\n')
sys.exit(1)
" > /dev/null