pdftract/benches/competitors
jedarden 398ab747fc fix(pdftract-60h): fix bugs in benchmark runner script
- Add extraction of pdftract_geomean from tool_geomeans array for regression gate
- Fix vector geomean calculation to properly pass bash array values to Python

The benchmark infrastructure was complete but had two bugs:
1. $pdftract_geomean was used but never set (line 308)
2. Vector geomean calculation had broken Python code for array expansion

These fixes ensure the regression and 10x-faster gates will work correctly
once the pdftract binary with extract/grep subcommands is available.

Refs pdftract-60h
2026-05-18 01:29:41 -04:00
..
corpus feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
README.md feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
requirements.txt feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
run-benchmarks.sh fix(pdftract-60h): fix bugs in benchmark runner script 2026-05-18 01:29:41 -04:00
run-pdfminer.sh feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
run-pdfplumber.sh feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
run-pdftract.sh feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
run-pypdf.sh feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00

Competitive Benchmarks

This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.

Purpose

Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:

  1. pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
  2. Performance regressions are caught in CI before merge
  3. Competitive positioning is tracked over time

Corpus

The benchmark corpus consists of 50 representative PDFs:

  • 25 vector PDFs (corpus/vector/) - Text-based PDFs where pdftract should excel
  • 25 raster PDFs (corpus/raster/) - Scanned documents requiring OCR

All documents are committed to the repository at ~10 MB total size.

Tools

All competitor versions are pinned in requirements.txt to ensure baseline stability:

  • pdfminer.six==20231228
  • pypdf==4.2.0
  • pdfplumber==0.11.0

Updates to these versions require a deliberate PR with manual baseline refresh.

Running Benchmarks Locally

Prerequisites

# Install hyperfine
apt-get install hyperfine

# Install competitor tools
pip install -r requirements.txt

# Ensure pdftract is in PATH
which pdftract

Quick Run

cd benches/competitors
./run-benchmarks.sh

Custom Baseline

BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh

CI Integration

The bench-matrix step in .ci/argo-workflows/pdftract-ci.yaml runs these benchmarks on every PR:

  1. Installs hyperfine and competitor tools
  2. Downloads the pdftract binary artifact from build-matrix
  3. Runs the full benchmark suite
  4. Checks regression and 10x-faster gates
  5. Publishes benchmark-results.json as an artifact
  6. Posts a formatted summary as a PR comment

Gates

Regression Gate

Compares pdftract's geometric mean time against the baseline (benches/baselines/main.json):

  • Threshold: 10% regression
  • Baseline source: git show main:benches/baselines/main.json
  • Failure: PR is blocked if regression > 10%

10x-Faster Gate

Ensures pdftract maintains its speed advantage:

  • Threshold: pdftract_geomean / pdfminer_geomean <= 0.1
  • Scope: Vector PDFs only (where pdftract should excel)
  • Failure: PR is blocked if ratio > 0.1 (less than 10x faster)

Special Benchmark: pdftract-grep-1000

Runs pdftract grep "the" wikipedia-1000.pdf 5 times with warmup:

  • Tests search performance on a 1000-page document
  • Regression > 10% blocks the PR
  • Independent of the main corpus benchmarks

Output Schema

benchmark-results.json contains an array of objects:

[
  {
    "tool": "pdftract",
    "doc": "misc-01.pdf",
    "mean_ms": 8.5,
    "stddev_ms": 0.3,
    "min_ms": 8.1,
    "max_ms": 9.2,
    "crash": false
  },
  {
    "tool": "pdfminer",
    "doc": "encrypted.pdf",
    "crash": true
  }
]

Crashes are excluded from geometric mean calculations but are recorded for visibility.

Baseline Schema

benches/baselines/main.json stores the commit-sha-specific baseline:

{
  "commit_sha": "abc123...",
  "timestamp": "2024-01-01T00:00:00Z",
  "pdftract_geomean": 10.0,
  "pdfminer_geomean": 100.0,
  "pypdf_geomean": 120.0,
  "pdfplumber_geomean": 150.0,
  "corpus_size": 50,
  "notes": "Baseline from main branch"
}

Noise Reduction

Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:

  1. Hyperfine warmup: 2 warmup runs discarded before timing
  2. Multiple runs: 5 timed runs per (tool, document) pair
  3. Geometric mean: Computed across all documents for each tool
  4. 95% CI: Reported in PR comments to show variance

Updating Baselines

When merging to main, the baseline can be refreshed:

  1. Run benchmarks locally or extract from CI artifacts
  2. Update benches/baselines/main.json with new geomeans
  3. Commit and push to main

Do NOT update baselines for PR branches - they should always compare against main.

Troubleshooting

Hyperfine not found

apt-get install hyperfine

Python tools not found

pip install -r benches/competitors/requirements.txt

Pdftract not found

Ensure the binary is built and in PATH, or use the CI artifact download.

High variance

  • Ensure CPU is not throttled (cpufreq-info)
  • Check for background processes consuming CPU
  • Run with more iterations (modify --runs 5 in script)

References

  • Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
  • Quality Targets, Tier 4 (competitive bench hard gate)
  • Mission (speed differentiator)