History

jedarden 398ab747fc fix(pdftract-60h): fix bugs in benchmark runner script - Add extraction of pdftract_geomean from tool_geomeans array for regression gate - Fix vector geomean calculation to properly pass bash array values to Python The benchmark infrastructure was complete but had two bugs: 1. $pdftract_geomean was used but never set (line 308) 2. Vector geomean calculation had broken Python code for array expansion These fixes ensure the regression and 10x-faster gates will work correctly once the pdftract binary with extract/grep subcommands is available. Refs pdftract-60h		2026-05-18 01:29:41 -04:00
..
corpus	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
README.md	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
requirements.txt	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
run-benchmarks.sh	fix(pdftract-60h): fix bugs in benchmark runner script	2026-05-18 01:29:41 -04:00
run-pdfminer.sh	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
run-pdfplumber.sh	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
run-pdftract.sh	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
run-pypdf.sh	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00

README.md

Competitive Benchmarks

This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.

Purpose

Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:

pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
Performance regressions are caught in CI before merge
Competitive positioning is tracked over time

Corpus

The benchmark corpus consists of 50 representative PDFs:

25 vector PDFs (corpus/vector/) - Text-based PDFs where pdftract should excel
25 raster PDFs (corpus/raster/) - Scanned documents requiring OCR

All documents are committed to the repository at ~10 MB total size.

Tools

All competitor versions are pinned in requirements.txt to ensure baseline stability:

pdfminer.six==20231228
pypdf==4.2.0
pdfplumber==0.11.0

Updates to these versions require a deliberate PR with manual baseline refresh.

Running Benchmarks Locally

Prerequisites

# Install hyperfine
apt-get install hyperfine

# Install competitor tools
pip install -r requirements.txt

# Ensure pdftract is in PATH
which pdftract

Quick Run

cd benches/competitors
./run-benchmarks.sh

Custom Baseline

BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh

CI Integration

The bench-matrix step in .ci/argo-workflows/pdftract-ci.yaml runs these benchmarks on every PR:

Installs hyperfine and competitor tools
Downloads the pdftract binary artifact from build-matrix
Runs the full benchmark suite
Checks regression and 10x-faster gates
Publishes benchmark-results.json as an artifact
Posts a formatted summary as a PR comment

Gates

Regression Gate

Compares pdftract's geometric mean time against the baseline (benches/baselines/main.json):

Threshold: 10% regression
Baseline source: git show main:benches/baselines/main.json
Failure: PR is blocked if regression > 10%

10x-Faster Gate

Ensures pdftract maintains its speed advantage:

Threshold: pdftract_geomean / pdfminer_geomean <= 0.1
Scope: Vector PDFs only (where pdftract should excel)
Failure: PR is blocked if ratio > 0.1 (less than 10x faster)

Special Benchmark: pdftract-grep-1000

Runs pdftract grep "the" wikipedia-1000.pdf 5 times with warmup:

Tests search performance on a 1000-page document
Regression > 10% blocks the PR
Independent of the main corpus benchmarks

Output Schema

benchmark-results.json contains an array of objects:

[
  {
    "tool": "pdftract",
    "doc": "misc-01.pdf",
    "mean_ms": 8.5,
    "stddev_ms": 0.3,
    "min_ms": 8.1,
    "max_ms": 9.2,
    "crash": false
  },
  {
    "tool": "pdfminer",
    "doc": "encrypted.pdf",
    "crash": true
  }
]

Crashes are excluded from geometric mean calculations but are recorded for visibility.

Baseline Schema

benches/baselines/main.json stores the commit-sha-specific baseline:

{
  "commit_sha": "abc123...",
  "timestamp": "2024-01-01T00:00:00Z",
  "pdftract_geomean": 10.0,
  "pdfminer_geomean": 100.0,
  "pypdf_geomean": 120.0,
  "pdfplumber_geomean": 150.0,
  "corpus_size": 50,
  "notes": "Baseline from main branch"
}

Noise Reduction

Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:

Hyperfine warmup: 2 warmup runs discarded before timing
Multiple runs: 5 timed runs per (tool, document) pair
Geometric mean: Computed across all documents for each tool
95% CI: Reported in PR comments to show variance

Updating Baselines

When merging to main, the baseline can be refreshed:

Run benchmarks locally or extract from CI artifacts
Update benches/baselines/main.json with new geomeans
Commit and push to main

Do NOT update baselines for PR branches - they should always compare against main.

Troubleshooting

Hyperfine not found

apt-get install hyperfine

Python tools not found

pip install -r benches/competitors/requirements.txt

Pdftract not found

Ensure the binary is built and in PATH, or use the CI artifact download.

High variance

Ensure CPU is not throttled (cpufreq-info)
Check for background processes consuming CPU
Run with more iterations (modify --runs 5 in script)

References

Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
Quality Targets, Tier 4 (competitive bench hard gate)
Mission (speed differentiator)