- Add extraction of pdftract_geomean from tool_geomeans array for regression gate - Fix vector geomean calculation to properly pass bash array values to Python The benchmark infrastructure was complete but had two bugs: 1. $pdftract_geomean was used but never set (line 308) 2. Vector geomean calculation had broken Python code for array expansion These fixes ensure the regression and 10x-faster gates will work correctly once the pdftract binary with extract/grep subcommands is available. Refs pdftract-60h |
||
|---|---|---|
| .. | ||
| corpus | ||
| README.md | ||
| requirements.txt | ||
| run-benchmarks.sh | ||
| run-pdfminer.sh | ||
| run-pdfplumber.sh | ||
| run-pdftract.sh | ||
| run-pypdf.sh | ||
Competitive Benchmarks
This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.
Purpose
Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:
- pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
- Performance regressions are caught in CI before merge
- Competitive positioning is tracked over time
Corpus
The benchmark corpus consists of 50 representative PDFs:
- 25 vector PDFs (
corpus/vector/) - Text-based PDFs where pdftract should excel - 25 raster PDFs (
corpus/raster/) - Scanned documents requiring OCR
All documents are committed to the repository at ~10 MB total size.
Tools
All competitor versions are pinned in requirements.txt to ensure baseline stability:
pdfminer.six==20231228pypdf==4.2.0pdfplumber==0.11.0
Updates to these versions require a deliberate PR with manual baseline refresh.
Running Benchmarks Locally
Prerequisites
# Install hyperfine
apt-get install hyperfine
# Install competitor tools
pip install -r requirements.txt
# Ensure pdftract is in PATH
which pdftract
Quick Run
cd benches/competitors
./run-benchmarks.sh
Custom Baseline
BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh
CI Integration
The bench-matrix step in .ci/argo-workflows/pdftract-ci.yaml runs these benchmarks on every PR:
- Installs hyperfine and competitor tools
- Downloads the pdftract binary artifact from build-matrix
- Runs the full benchmark suite
- Checks regression and 10x-faster gates
- Publishes
benchmark-results.jsonas an artifact - Posts a formatted summary as a PR comment
Gates
Regression Gate
Compares pdftract's geometric mean time against the baseline (benches/baselines/main.json):
- Threshold: 10% regression
- Baseline source:
git show main:benches/baselines/main.json - Failure: PR is blocked if regression > 10%
10x-Faster Gate
Ensures pdftract maintains its speed advantage:
- Threshold:
pdftract_geomean / pdfminer_geomean <= 0.1 - Scope: Vector PDFs only (where pdftract should excel)
- Failure: PR is blocked if ratio > 0.1 (less than 10x faster)
Special Benchmark: pdftract-grep-1000
Runs pdftract grep "the" wikipedia-1000.pdf 5 times with warmup:
- Tests search performance on a 1000-page document
- Regression > 10% blocks the PR
- Independent of the main corpus benchmarks
Output Schema
benchmark-results.json contains an array of objects:
[
{
"tool": "pdftract",
"doc": "misc-01.pdf",
"mean_ms": 8.5,
"stddev_ms": 0.3,
"min_ms": 8.1,
"max_ms": 9.2,
"crash": false
},
{
"tool": "pdfminer",
"doc": "encrypted.pdf",
"crash": true
}
]
Crashes are excluded from geometric mean calculations but are recorded for visibility.
Baseline Schema
benches/baselines/main.json stores the commit-sha-specific baseline:
{
"commit_sha": "abc123...",
"timestamp": "2024-01-01T00:00:00Z",
"pdftract_geomean": 10.0,
"pdfminer_geomean": 100.0,
"pypdf_geomean": 120.0,
"pdfplumber_geomean": 150.0,
"corpus_size": 50,
"notes": "Baseline from main branch"
}
Noise Reduction
Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:
- Hyperfine warmup: 2 warmup runs discarded before timing
- Multiple runs: 5 timed runs per (tool, document) pair
- Geometric mean: Computed across all documents for each tool
- 95% CI: Reported in PR comments to show variance
Updating Baselines
When merging to main, the baseline can be refreshed:
- Run benchmarks locally or extract from CI artifacts
- Update
benches/baselines/main.jsonwith new geomeans - Commit and push to main
Do NOT update baselines for PR branches - they should always compare against main.
Troubleshooting
Hyperfine not found
apt-get install hyperfine
Python tools not found
pip install -r benches/competitors/requirements.txt
Pdftract not found
Ensure the binary is built and in PATH, or use the CI artifact download.
High variance
- Ensure CPU is not throttled (
cpufreq-info) - Check for background processes consuming CPU
- Run with more iterations (modify
--runs 5in script)
References
- Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
- Quality Targets, Tier 4 (competitive bench hard gate)
- Mission (speed differentiator)