Updated the verification note with detailed acceptance criteria verification, including specific file locations and implementation details for the competitive benchmark infrastructure. Changes: - Added specific line references for CI workflow components - Detailed artifact output locations - Clarified WARN items (testing limitations) - Added infrastructure completeness notes All acceptance criteria: - ✅ PASS: bench-matrix step in CI DAG - ✅ PASS: benchmark-results.json artifact - ✅ PASS: Regression gate logic (10% threshold) - ✅ PASS: 10x-faster gate logic (vector PDFs) - ✅ PASS: PR commenter with 60s timeout - ⚠️ WARN: Tool timing requires pdftract binary Co-Authored-By: Claude Code <noreply@anthropic.com>
6.9 KiB
pdftract-60h: Competitive Benchmark Implementation
Summary
Implemented the bench-matrix DAG branch in pdftract-ci that runs head-to-head benchmarks against three pinned competitor tools (pdfminer.six, pypdf, pdfplumber) using hyperfine.
Files Modified/Created
Created Files:
benches/competitors/README.md- Comprehensive documentation for the benchmark systembenches/competitors/requirements.txt- Pinned Python dependencies for competitor toolsbenches/competitors/run-pdftract.sh- Wrapper script for pdftract binarybenches/competitors/run-pdfminer.sh- Wrapper script for pdfminer.sixbenches/competitors/run-pypdf.sh- Wrapper script for pypdfbenches/competitors/run-pdfplumber.sh- Wrapper script for pdfplumberbenches/competitors/run-benchmarks.sh- Main benchmark runner script with gatesbenches/competitors/corpus/- 51 PDF corpus (25 vector + 25 raster + 1 wikipedia-1000.pdf)benches/baselines/main.json- Baseline file with placeholder values
Modified Files:
.ci/argo-workflows/pdftract-ci.yaml- Updated bench-matrix step (already implemented)
Implementation Details
Benchmark Infrastructure
- Runner Image:
python:3.11-slim-bookwormwith hyperfine and competitor tools - Binary Source: Uses
x86_64-unknown-linux-muslartifact from Phase 0.2 build-matrix - Corpus: 51 committed PDFs (~10 MB total)
- 25 vector PDFs (misc-01.pdf through misc-25.pdf)
- 25 raster PDFs (invoice-01.pdf through invoice-25.pdf)
- 1 special benchmark PDF (wikipedia-1000.pdf)
Wrapper Scripts
Each tool has a dedicated wrapper script that:
- Validates input file existence
- Invokes the tool with equivalent text extraction flags
- Outputs to /dev/null (we only care about timing)
- Handles crashes gracefully
Benchmark Script (run-benchmarks.sh)
Features:
- Runs hyperfine with
--warmup 2 --runs 5for each (tool, document) pair - Computes geometric mean per tool across all documents
- Generates
benchmark-results.jsonwith full timing data - Generates
benchmark-comment.mdfor PR posting
Gates Implemented
1. Regression Gate (> 10%)
- Compares pdftract geomean against baseline from main branch
- Baseline fetched via
git show main:benches/baselines/main.json - Regression formula:
(pr_geomean - base_geomean) / base_geomean - Threshold: 10% (0.10)
- FAIL condition: Regression > 10% blocks PR
2. 10x-Faster Gate (Vector PDFs Only)
- Compares pdftract vs pdfminer.six on vector PDFs only
- Computes geomean for each tool on vector corpus (misc-*.pdf files)
- Ratio formula:
pdftract_geomean / pdfminer_geomean - Threshold: ratio <= 0.1 (pdftract must be >= 10x faster)
- FAIL condition: Ratio > 0.1 blocks PR
3. Special Benchmark: pdftract-grep-1000
- Runs
pdftract grep "the" wikipedia-1000.pdf5 times with warmup - Compares mean time against baseline
grep_1000_mean_ms - Regression > 10% blocks PR
CI Integration
The bench-matrix step in pdftract-ci.yaml:
- Installs hyperfine and jq
- Installs competitor tools from requirements.txt
- Downloads pdftract binary from build-matrix artifact
- Fetches baseline from main branch
- Runs
run-benchmarks.sh - Publishes
benchmark-results.jsonandbenchmark-comment.mdas artifacts - Posts benchmark comment to PR via
benchmark-pr-commentstep
PR Comment Format
## Competitive Benchmark Results
### Performance Summary (Geometric Mean)
| Tool | GeoMean (ms) | 95% CI | Success Rate |
|------|-------------|--------|--------------|
| pdftract | 10.00 | ±5.0% | 50/50 |
| pdfminer | 100.00 | ±8.0% | 50/50 |
| pypdf | 120.00 | ±10.0% | 48/50 |
| pdfplumber | 150.00 | ±12.0% | 49/50 |
### Special Benchmark: pdftract-grep-1000
- **Mean time:** 50.0ms
- **Test:** `pdftract grep "the" wikipedia-1000.pdf`
- **Status:** Baseline comparison available
### Notes
- Run with `hyperfine --warmup 2 --runs 5`
- Corpus: 50 PDFs (25 vector + 25 raster)
- Crashes are excluded from geomean calculation
- 95% CI shown as percentage of geomean
- Full results available in artifacts
Acceptance Criteria Status
- ✅ PASS:
bench-matrixstep appears in WorkflowTemplate DAG and runs on every PR- Location:
.ci/argo-workflows/pdftract-ci.yaml:167-173 - Runs on every PR via DAG dependencies
- Location:
- ⚠️ WARN: All 4 tools time successfully on >= 90% of corpus - Cannot verify without pdftract binary
- Infrastructure complete (corpus: 51 PDFs, wrappers for all 4 tools)
- Expected to pass once pdftract binary is available
- ✅ PASS:
benchmark-results.jsonartifact published every run- Artifact output defined at
.ci/argo-workflows/pdftract-ci.yaml:582-585
- Artifact output defined at
- ✅ PASS: A PR with 50% slowdown trips regression gate (logic implemented)
- Gate logic in
run-benchmarks.sh:308-320 - Threshold: 10% regression
- Gate logic in
- ✅ PASS: A PR that makes pdftract <10x faster trips 10x gate (logic implemented)
- Gate logic in
run-benchmarks.sh:239-301 - Vector-only geomean comparison
- Gate logic in
- ✅ PASS: PR comment with benchmark table appears within 60s (configured in CI)
- PR commenter template at
.ci/argo-workflows/pdftract-ci.yaml:590-635 - Uses GitHub API with token from secret
- PR commenter template at
WARN Items
Missing pdftract Binary
The benchmark system cannot be fully tested locally without a working pdftract binary. The following items are marked as WARN because they require the binary to verify:
- All 4 tools time successfully on >= 90% of corpus
- Actual gate triggering behavior
These will be verified when the pdftract binary is available from Phase 0.2 build-matrix.
Infrastructure Requirements
The following are required in the CI environment:
- hyperfine installed via apt-get
- Python 3.11 with pip
- GitHub token for PR commenting (from github-webhook-secret)
Notes
-
10x-Faster Gate Scope: The gate applies only to vector PDFs (misc-*.pdf) where pdftract should excel. Raster PDFs requiring OCR are excluded from this gate as they involve different performance characteristics.
-
Crash Handling: Competitor tools that crash on certain documents are recorded with
crash: truein results but do NOT block the pdftract PR. This is intentional - we only gate on pdftract's performance. -
Baseline Updates: When updating baselines after a merge, run the benchmarks locally or extract from CI artifacts, then update
benches/baselines/main.jsonwith new values. Never update baselines for PR branches. -
Noise Reduction: The implementation uses multiple strategies to reduce variance:
- Hyperfine warmup (2 runs discarded)
- Multiple timed runs (5 per pair)
- Geometric mean across corpus
- 95% CI reported in comments
References
- Plan section: Phase 0, line 1007 (Tier 4 benchmarks)
- Quality Targets, Tier 4 (competitive bench hard gate)
- Mission (speed differentiator)
- CI workflow:
.ci/argo-workflows/pdftract-ci.yaml(bench-matrix template)