pdftract/benches/competitors/README.md

# Competitive Benchmarks

This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.

## Purpose

Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:
1. pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
2. Performance regressions are caught in CI before merge
3. Competitive positioning is tracked over time

## Corpus

The benchmark corpus consists of 50 representative PDFs:
- **25 vector PDFs** (`corpus/vector/`) - Text-based PDFs where pdftract should excel
- **25 raster PDFs** (`corpus/raster/`) - Scanned documents requiring OCR

All documents are committed to the repository at ~10 MB total size.

## Tools

All competitor versions are pinned in `requirements.txt` to ensure baseline stability:
- `pdfminer.six==20231228`
- `pypdf==4.2.0`
- `pdfplumber==0.11.0`

Updates to these versions require a deliberate PR with manual baseline refresh.

## Running Benchmarks Locally

### Prerequisites

```bash
# Install hyperfine
apt-get install hyperfine

# Install competitor tools
pip install -r requirements.txt

# Ensure pdftract is in PATH
which pdftract
```

### Quick Run

```bash
cd benches/competitors
./run-benchmarks.sh
```

### Custom Baseline

```bash
BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh
```

## CI Integration

The `bench-matrix` step in `.ci/argo-workflows/pdftract-ci.yaml` runs these benchmarks on every PR:
1. Installs hyperfine and competitor tools
2. Downloads the pdftract binary artifact from build-matrix
3. Runs the full benchmark suite
4. Checks regression and 10x-faster gates
5. Publishes `benchmark-results.json` as an artifact
6. Posts a formatted summary as a PR comment

## Gates

### Regression Gate

Compares pdftract's geometric mean time against the baseline (`benches/baselines/main.json`):
- **Threshold:** 10% regression
- **Baseline source:** `git show main:benches/baselines/main.json`
- **Failure:** PR is blocked if regression > 10%

### 10x-Faster Gate

Ensures pdftract maintains its speed advantage:
- **Threshold:** `pdftract_geomean / pdfminer_geomean <= 0.1`
- **Scope:** Vector PDFs only (where pdftract should excel)
- **Failure:** PR is blocked if ratio > 0.1 (less than 10x faster)

### Special Benchmark: pdftract-grep-1000

Runs `pdftract grep "the" wikipedia-1000.pdf` 5 times with warmup:
- Tests search performance on a 1000-page document
- Regression > 10% blocks the PR
- Independent of the main corpus benchmarks

## Output Schema

`benchmark-results.json` contains an array of objects:

```json
[
  {
    "tool": "pdftract",
    "doc": "misc-01.pdf",
    "mean_ms": 8.5,
    "stddev_ms": 0.3,
    "min_ms": 8.1,
    "max_ms": 9.2,
    "crash": false
  },
  {
    "tool": "pdfminer",
    "doc": "encrypted.pdf",
    "crash": true
  }
]
```

Crashes are excluded from geometric mean calculations but are recorded for visibility.

## Baseline Schema

`benches/baselines/main.json` stores the commit-sha-specific baseline:

```json
{
  "commit_sha": "abc123...",
  "timestamp": "2024-01-01T00:00:00Z",
  "pdftract_geomean": 10.0,
  "pdfminer_geomean": 100.0,
  "pypdf_geomean": 120.0,
  "pdfplumber_geomean": 150.0,
  "corpus_size": 50,
  "notes": "Baseline from main branch"
}
```

## Noise Reduction

Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:
1. **Hyperfine warmup:** 2 warmup runs discarded before timing
2. **Multiple runs:** 5 timed runs per (tool, document) pair
3. **Geometric mean:** Computed across all documents for each tool
4. **95% CI:** Reported in PR comments to show variance

## Updating Baselines

When merging to main, the baseline can be refreshed:

1. Run benchmarks locally or extract from CI artifacts
2. Update `benches/baselines/main.json` with new geomeans
3. Commit and push to main

Do NOT update baselines for PR branches - they should always compare against main.

## Troubleshooting

### Hyperfine not found

```bash
apt-get install hyperfine
```

### Python tools not found

```bash
pip install -r benches/competitors/requirements.txt
```

### Pdftract not found

Ensure the binary is built and in PATH, or use the CI artifact download.

### High variance

- Ensure CPU is not throttled (`cpufreq-info`)
- Check for background processes consuming CPU
- Run with more iterations (modify `--runs 5` in script)

## References

- Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
- Quality Targets, Tier 4 (competitive bench hard gate)
- Mission (speed differentiator)