pdftract/benches/competitors/README.md
jedarden 857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00

4.6 KiB

Competitive Benchmarks

This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.

Purpose

Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:

  1. pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
  2. Performance regressions are caught in CI before merge
  3. Competitive positioning is tracked over time

Corpus

The benchmark corpus consists of 50 representative PDFs:

  • 25 vector PDFs (corpus/vector/) - Text-based PDFs where pdftract should excel
  • 25 raster PDFs (corpus/raster/) - Scanned documents requiring OCR

All documents are committed to the repository at ~10 MB total size.

Tools

All competitor versions are pinned in requirements.txt to ensure baseline stability:

  • pdfminer.six==20231228
  • pypdf==4.2.0
  • pdfplumber==0.11.0

Updates to these versions require a deliberate PR with manual baseline refresh.

Running Benchmarks Locally

Prerequisites

# Install hyperfine
apt-get install hyperfine

# Install competitor tools
pip install -r requirements.txt

# Ensure pdftract is in PATH
which pdftract

Quick Run

cd benches/competitors
./run-benchmarks.sh

Custom Baseline

BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh

CI Integration

The bench-matrix step in .ci/argo-workflows/pdftract-ci.yaml runs these benchmarks on every PR:

  1. Installs hyperfine and competitor tools
  2. Downloads the pdftract binary artifact from build-matrix
  3. Runs the full benchmark suite
  4. Checks regression and 10x-faster gates
  5. Publishes benchmark-results.json as an artifact
  6. Posts a formatted summary as a PR comment

Gates

Regression Gate

Compares pdftract's geometric mean time against the baseline (benches/baselines/main.json):

  • Threshold: 10% regression
  • Baseline source: git show main:benches/baselines/main.json
  • Failure: PR is blocked if regression > 10%

10x-Faster Gate

Ensures pdftract maintains its speed advantage:

  • Threshold: pdftract_geomean / pdfminer_geomean <= 0.1
  • Scope: Vector PDFs only (where pdftract should excel)
  • Failure: PR is blocked if ratio > 0.1 (less than 10x faster)

Special Benchmark: pdftract-grep-1000

Runs pdftract grep "the" wikipedia-1000.pdf 5 times with warmup:

  • Tests search performance on a 1000-page document
  • Regression > 10% blocks the PR
  • Independent of the main corpus benchmarks

Output Schema

benchmark-results.json contains an array of objects:

[
  {
    "tool": "pdftract",
    "doc": "misc-01.pdf",
    "mean_ms": 8.5,
    "stddev_ms": 0.3,
    "min_ms": 8.1,
    "max_ms": 9.2,
    "crash": false
  },
  {
    "tool": "pdfminer",
    "doc": "encrypted.pdf",
    "crash": true
  }
]

Crashes are excluded from geometric mean calculations but are recorded for visibility.

Baseline Schema

benches/baselines/main.json stores the commit-sha-specific baseline:

{
  "commit_sha": "abc123...",
  "timestamp": "2024-01-01T00:00:00Z",
  "pdftract_geomean": 10.0,
  "pdfminer_geomean": 100.0,
  "pypdf_geomean": 120.0,
  "pdfplumber_geomean": 150.0,
  "corpus_size": 50,
  "notes": "Baseline from main branch"
}

Noise Reduction

Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:

  1. Hyperfine warmup: 2 warmup runs discarded before timing
  2. Multiple runs: 5 timed runs per (tool, document) pair
  3. Geometric mean: Computed across all documents for each tool
  4. 95% CI: Reported in PR comments to show variance

Updating Baselines

When merging to main, the baseline can be refreshed:

  1. Run benchmarks locally or extract from CI artifacts
  2. Update benches/baselines/main.json with new geomeans
  3. Commit and push to main

Do NOT update baselines for PR branches - they should always compare against main.

Troubleshooting

Hyperfine not found

apt-get install hyperfine

Python tools not found

pip install -r benches/competitors/requirements.txt

Pdftract not found

Ensure the binary is built and in PATH, or use the CI artifact download.

High variance

  • Ensure CPU is not throttled (cpufreq-info)
  • Check for background processes consuming CPU
  • Run with more iterations (modify --runs 5 in script)

References

  • Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
  • Quality Targets, Tier 4 (competitive bench hard gate)
  • Mission (speed differentiator)