pdftract/benches/competitors/README.md
jedarden 857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00

178 lines
4.6 KiB
Markdown

# Competitive Benchmarks
This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.
## Purpose
Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:
1. pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
2. Performance regressions are caught in CI before merge
3. Competitive positioning is tracked over time
## Corpus
The benchmark corpus consists of 50 representative PDFs:
- **25 vector PDFs** (`corpus/vector/`) - Text-based PDFs where pdftract should excel
- **25 raster PDFs** (`corpus/raster/`) - Scanned documents requiring OCR
All documents are committed to the repository at ~10 MB total size.
## Tools
All competitor versions are pinned in `requirements.txt` to ensure baseline stability:
- `pdfminer.six==20231228`
- `pypdf==4.2.0`
- `pdfplumber==0.11.0`
Updates to these versions require a deliberate PR with manual baseline refresh.
## Running Benchmarks Locally
### Prerequisites
```bash
# Install hyperfine
apt-get install hyperfine
# Install competitor tools
pip install -r requirements.txt
# Ensure pdftract is in PATH
which pdftract
```
### Quick Run
```bash
cd benches/competitors
./run-benchmarks.sh
```
### Custom Baseline
```bash
BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh
```
## CI Integration
The `bench-matrix` step in `.ci/argo-workflows/pdftract-ci.yaml` runs these benchmarks on every PR:
1. Installs hyperfine and competitor tools
2. Downloads the pdftract binary artifact from build-matrix
3. Runs the full benchmark suite
4. Checks regression and 10x-faster gates
5. Publishes `benchmark-results.json` as an artifact
6. Posts a formatted summary as a PR comment
## Gates
### Regression Gate
Compares pdftract's geometric mean time against the baseline (`benches/baselines/main.json`):
- **Threshold:** 10% regression
- **Baseline source:** `git show main:benches/baselines/main.json`
- **Failure:** PR is blocked if regression > 10%
### 10x-Faster Gate
Ensures pdftract maintains its speed advantage:
- **Threshold:** `pdftract_geomean / pdfminer_geomean <= 0.1`
- **Scope:** Vector PDFs only (where pdftract should excel)
- **Failure:** PR is blocked if ratio > 0.1 (less than 10x faster)
### Special Benchmark: pdftract-grep-1000
Runs `pdftract grep "the" wikipedia-1000.pdf` 5 times with warmup:
- Tests search performance on a 1000-page document
- Regression > 10% blocks the PR
- Independent of the main corpus benchmarks
## Output Schema
`benchmark-results.json` contains an array of objects:
```json
[
{
"tool": "pdftract",
"doc": "misc-01.pdf",
"mean_ms": 8.5,
"stddev_ms": 0.3,
"min_ms": 8.1,
"max_ms": 9.2,
"crash": false
},
{
"tool": "pdfminer",
"doc": "encrypted.pdf",
"crash": true
}
]
```
Crashes are excluded from geometric mean calculations but are recorded for visibility.
## Baseline Schema
`benches/baselines/main.json` stores the commit-sha-specific baseline:
```json
{
"commit_sha": "abc123...",
"timestamp": "2024-01-01T00:00:00Z",
"pdftract_geomean": 10.0,
"pdfminer_geomean": 100.0,
"pypdf_geomean": 120.0,
"pdfplumber_geomean": 150.0,
"corpus_size": 50,
"notes": "Baseline from main branch"
}
```
## Noise Reduction
Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:
1. **Hyperfine warmup:** 2 warmup runs discarded before timing
2. **Multiple runs:** 5 timed runs per (tool, document) pair
3. **Geometric mean:** Computed across all documents for each tool
4. **95% CI:** Reported in PR comments to show variance
## Updating Baselines
When merging to main, the baseline can be refreshed:
1. Run benchmarks locally or extract from CI artifacts
2. Update `benches/baselines/main.json` with new geomeans
3. Commit and push to main
Do NOT update baselines for PR branches - they should always compare against main.
## Troubleshooting
### Hyperfine not found
```bash
apt-get install hyperfine
```
### Python tools not found
```bash
pip install -r benches/competitors/requirements.txt
```
### Pdftract not found
Ensure the binary is built and in PATH, or use the CI artifact download.
### High variance
- Ensure CPU is not throttled (`cpufreq-info`)
- Check for background processes consuming CPU
- Run with more iterations (modify `--runs 5` in script)
## References
- Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
- Quality Targets, Tier 4 (competitive bench hard gate)
- Mission (speed differentiator)