Implement the conformance test runner pattern that every SDK will implement to validate against the shared test suite. - Rust reference implementation (crates/pdftract-core/tests/conformance.rs) * Full test suite loader and executor * Comparison engine with min/max, string constraints, tolerances * Skip logic for unsupported features and schema versions * Report generation in JSON format - CLI compare subcommand (crates/pdftract-cli/src/main.rs) * pdftract compare - Compare actual vs expected with tolerances * Cross-language comparison tool to avoid reimplementations - Documentation (docs/conformance/sdk-contract.md) * Complete pattern specification with pseudocode * Per-language runner locations * CI integration requirements - Python reference stub (tests/python-conformance/test_conformance.py) * Full pytest-based implementation following the pattern Closes: pdftract-5omc
178 lines
4.6 KiB
Markdown
178 lines
4.6 KiB
Markdown
# Competitive Benchmarks
|
|
|
|
This directory contains the competitive benchmark infrastructure for pdftract, comparing its performance against three popular Python PDF libraries: pdfminer.six, pypdf, and pdfplumber.
|
|
|
|
## Purpose
|
|
|
|
Speed is one of pdftract's three differentiators (per the Mission statement). These benchmarks ensure that:
|
|
1. pdftract maintains at least 10x speed advantage over pdfminer.six on vector PDFs
|
|
2. Performance regressions are caught in CI before merge
|
|
3. Competitive positioning is tracked over time
|
|
|
|
## Corpus
|
|
|
|
The benchmark corpus consists of 50 representative PDFs:
|
|
- **25 vector PDFs** (`corpus/vector/`) - Text-based PDFs where pdftract should excel
|
|
- **25 raster PDFs** (`corpus/raster/`) - Scanned documents requiring OCR
|
|
|
|
All documents are committed to the repository at ~10 MB total size.
|
|
|
|
## Tools
|
|
|
|
All competitor versions are pinned in `requirements.txt` to ensure baseline stability:
|
|
- `pdfminer.six==20231228`
|
|
- `pypdf==4.2.0`
|
|
- `pdfplumber==0.11.0`
|
|
|
|
Updates to these versions require a deliberate PR with manual baseline refresh.
|
|
|
|
## Running Benchmarks Locally
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Install hyperfine
|
|
apt-get install hyperfine
|
|
|
|
# Install competitor tools
|
|
pip install -r requirements.txt
|
|
|
|
# Ensure pdftract is in PATH
|
|
which pdftract
|
|
```
|
|
|
|
### Quick Run
|
|
|
|
```bash
|
|
cd benches/competitors
|
|
./run-benchmarks.sh
|
|
```
|
|
|
|
### Custom Baseline
|
|
|
|
```bash
|
|
BASELINE=/path/to/baseline.json OUTPUT=results.json ./run-benchmarks.sh
|
|
```
|
|
|
|
## CI Integration
|
|
|
|
The `bench-matrix` step in `.ci/argo-workflows/pdftract-ci.yaml` runs these benchmarks on every PR:
|
|
1. Installs hyperfine and competitor tools
|
|
2. Downloads the pdftract binary artifact from build-matrix
|
|
3. Runs the full benchmark suite
|
|
4. Checks regression and 10x-faster gates
|
|
5. Publishes `benchmark-results.json` as an artifact
|
|
6. Posts a formatted summary as a PR comment
|
|
|
|
## Gates
|
|
|
|
### Regression Gate
|
|
|
|
Compares pdftract's geometric mean time against the baseline (`benches/baselines/main.json`):
|
|
- **Threshold:** 10% regression
|
|
- **Baseline source:** `git show main:benches/baselines/main.json`
|
|
- **Failure:** PR is blocked if regression > 10%
|
|
|
|
### 10x-Faster Gate
|
|
|
|
Ensures pdftract maintains its speed advantage:
|
|
- **Threshold:** `pdftract_geomean / pdfminer_geomean <= 0.1`
|
|
- **Scope:** Vector PDFs only (where pdftract should excel)
|
|
- **Failure:** PR is blocked if ratio > 0.1 (less than 10x faster)
|
|
|
|
### Special Benchmark: pdftract-grep-1000
|
|
|
|
Runs `pdftract grep "the" wikipedia-1000.pdf` 5 times with warmup:
|
|
- Tests search performance on a 1000-page document
|
|
- Regression > 10% blocks the PR
|
|
- Independent of the main corpus benchmarks
|
|
|
|
## Output Schema
|
|
|
|
`benchmark-results.json` contains an array of objects:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"tool": "pdftract",
|
|
"doc": "misc-01.pdf",
|
|
"mean_ms": 8.5,
|
|
"stddev_ms": 0.3,
|
|
"min_ms": 8.1,
|
|
"max_ms": 9.2,
|
|
"crash": false
|
|
},
|
|
{
|
|
"tool": "pdfminer",
|
|
"doc": "encrypted.pdf",
|
|
"crash": true
|
|
}
|
|
]
|
|
```
|
|
|
|
Crashes are excluded from geometric mean calculations but are recorded for visibility.
|
|
|
|
## Baseline Schema
|
|
|
|
`benches/baselines/main.json` stores the commit-sha-specific baseline:
|
|
|
|
```json
|
|
{
|
|
"commit_sha": "abc123...",
|
|
"timestamp": "2024-01-01T00:00:00Z",
|
|
"pdftract_geomean": 10.0,
|
|
"pdfminer_geomean": 100.0,
|
|
"pypdf_geomean": 120.0,
|
|
"pdfplumber_geomean": 150.0,
|
|
"corpus_size": 50,
|
|
"notes": "Baseline from main branch"
|
|
}
|
|
```
|
|
|
|
## Noise Reduction
|
|
|
|
Benchmark variance on Spot infrastructure can be high. The following strategies reduce noise:
|
|
1. **Hyperfine warmup:** 2 warmup runs discarded before timing
|
|
2. **Multiple runs:** 5 timed runs per (tool, document) pair
|
|
3. **Geometric mean:** Computed across all documents for each tool
|
|
4. **95% CI:** Reported in PR comments to show variance
|
|
|
|
## Updating Baselines
|
|
|
|
When merging to main, the baseline can be refreshed:
|
|
|
|
1. Run benchmarks locally or extract from CI artifacts
|
|
2. Update `benches/baselines/main.json` with new geomeans
|
|
3. Commit and push to main
|
|
|
|
Do NOT update baselines for PR branches - they should always compare against main.
|
|
|
|
## Troubleshooting
|
|
|
|
### Hyperfine not found
|
|
|
|
```bash
|
|
apt-get install hyperfine
|
|
```
|
|
|
|
### Python tools not found
|
|
|
|
```bash
|
|
pip install -r benches/competitors/requirements.txt
|
|
```
|
|
|
|
### Pdftract not found
|
|
|
|
Ensure the binary is built and in PATH, or use the CI artifact download.
|
|
|
|
### High variance
|
|
|
|
- Ensure CPU is not throttled (`cpufreq-info`)
|
|
- Check for background processes consuming CPU
|
|
- Run with more iterations (modify `--runs 5` in script)
|
|
|
|
## References
|
|
|
|
- Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
|
|
- Quality Targets, Tier 4 (competitive bench hard gate)
|
|
- Mission (speed differentiator)
|