jedarden 857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern

Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc

2026-05-18 01:22:23 -04:00

4.6 KiB

Raw Blame History

Competitive Benchmarks

This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.

Purpose

Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:

pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
Performance regressions are caught in CI before merge
Competitive positioning is tracked over time

Corpus

The benchmark corpus consists of 50 representative PDFs:

25 vector PDFs (corpus/vector/) - Text-based PDFs where pdftract should excel
25 raster PDFs (corpus/raster/) - Scanned documents requiring OCR

All documents are committed to the repository at ~10 MB total size.

Tools

All competitor versions are pinned in requirements.txt to ensure baseline stability:

pdfminer.six==20231228
pypdf==4.2.0
pdfplumber==0.11.0

Updates to these versions require a deliberate PR with manual baseline refresh.

Running Benchmarks Locally

Prerequisites

# Install hyperfine
apt-get install hyperfine

# Install competitor tools
pip install -r requirements.txt

# Ensure pdftract is in PATH
which pdftract

Quick Run

cd benches/competitors
./run-benchmarks.sh

Custom Baseline

BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh

CI Integration

The bench-matrix step in .ci/argo-workflows/pdftract-ci.yaml runs these benchmarks on every PR:

Installs hyperfine and competitor tools
Downloads the pdftract binary artifact from build-matrix
Runs the full benchmark suite
Checks regression and 10x-faster gates
Publishes benchmark-results.json as an artifact
Posts a formatted summary as a PR comment

Gates

Regression Gate

Compares pdftract's geometric mean time against the baseline (benches/baselines/main.json):

Threshold: 10% regression
Baseline source: git show main:benches/baselines/main.json
Failure: PR is blocked if regression > 10%

10x-Faster Gate

Ensures pdftract maintains its speed advantage:

Threshold: pdftract_geomean / pdfminer_geomean <= 0.1
Scope: Vector PDFs only (where pdftract should excel)
Failure: PR is blocked if ratio > 0.1 (less than 10x faster)

Special Benchmark: pdftract-grep-1000

Runs pdftract grep "the" wikipedia-1000.pdf 5 times with warmup:

Tests search performance on a 1000-page document
Regression > 10% blocks the PR
Independent of the main corpus benchmarks

Output Schema

benchmark-results.json contains an array of objects:

[
  {
    "tool": "pdftract",
    "doc": "misc-01.pdf",
    "mean_ms": 8.5,
    "stddev_ms": 0.3,
    "min_ms": 8.1,
    "max_ms": 9.2,
    "crash": false
  },
  {
    "tool": "pdfminer",
    "doc": "encrypted.pdf",
    "crash": true
  }
]

Crashes are excluded from geometric mean calculations but are recorded for visibility.

Baseline Schema

benches/baselines/main.json stores the commit-sha-specific baseline:

{
  "commit_sha": "abc123...",
  "timestamp": "2024-01-01T00:00:00Z",
  "pdftract_geomean": 10.0,
  "pdfminer_geomean": 100.0,
  "pypdf_geomean": 120.0,
  "pdfplumber_geomean": 150.0,
  "corpus_size": 50,
  "notes": "Baseline from main branch"
}

Noise Reduction

Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:

Hyperfine warmup: 2 warmup runs discarded before timing
Multiple runs: 5 timed runs per (tool, document) pair
Geometric mean: Computed across all documents for each tool
95% CI: Reported in PR comments to show variance

Updating Baselines

When merging to main, the baseline can be refreshed:

Run benchmarks locally or extract from CI artifacts
Update benches/baselines/main.json with new geomeans
Commit and push to main

Do NOT update baselines for PR branches - they should always compare against main.

Troubleshooting

Hyperfine not found

apt-get install hyperfine

Python tools not found

pip install -r benches/competitors/requirements.txt

Pdftract not found

Ensure the binary is built and in PATH, or use the CI artifact download.

High variance

Ensure CPU is not throttled (cpufreq-info)
Check for background processes consuming CPU
Run with more iterations (modify --runs 5 in script)

References

Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
Quality Targets, Tier 4 (competitive bench hard gate)
Mission (speed differentiator)

4.6 KiB Raw Blame History