Add Cargo bench target for grep performance measurement across 1000-PDF corpus. Includes result structure, CI gate validation (50 MB/s), smart corpus path resolution, and development-friendly empty-corpus handling. Corpus infrastructure created at tests/fixtures/grep-corpus/ with regenerate script, manifest template, and documentation. Benchmark ready to wire to actual grep implementation once 7.8.3-7.8.8 sub-tasks complete. Closes: pdftract-5bzpg Files: - crates/pdftract-cli/Cargo.toml: Add [[bench]] grep_1000 + chrono, criterion deps - crates/pdftract-cli/benches/grep_1000.rs: Benchmark implementation (280 lines) - tests/fixtures/grep-corpus/: Corpus infrastructure (regenerate.sh, manifest, README) - notes/pdftract-5bzpg.md: Verification note with acceptance criteria status Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| corpus | ||
| manifest.csv | ||
| README.md | ||
| regenerate.sh | ||
pdftract grep-corpus
Benchmark corpus for pdftract-grep-1000 CI benchmark.
Purpose
This corpus contains 1000 PDFs (~100 MB total) used to benchmark and validate the grep feature's performance and correctness.
Structure
tests/fixtures/grep-corpus/
├── corpus/ # Actual PDF files
├── manifest.csv # File metadata and expected match counts
├── regenerate.sh # Script to rebuild the corpus
└── README.md # This file
Usage
Running the benchmark
cargo bench --bench grep_1000
Regenerating the corpus
cd tests/fixtures/grep-corpus
./regenerate.sh
Corpus Requirements
The corpus must satisfy:
- Size: 1000 PDF files, ~100 MB total
- Content: Mix of vector and scanned PDFs
- License: Public domain or permissive (CC BY-SA, MIT, etc.)
- Determinism: Regenerable from source (no manual uploads)
CI Gates
The benchmark enforces these gates on every PR:
- Throughput: ≥ 50 MB/s on 4-core CI machine
- vs pdfgrep: ≥ 2× faster
- vs pdftotext+ripgrep: ≥ 3× faster
- Regression: ≤ 10% vs historical main
Status
TODO: Populate corpus (blocks on 7.8.1-7.8.9 grep implementation).
Sources (TODO)
Potential corpus sources:
- arXiv API (public domain metadata)
- Wikipedia article exports (CC BY-SA)
- Synthetic PDFs via pdfjoin
Manifest Format
filename,size_bytes,expected_matches_for_pattern_the
doc001.pdf,102400,42
doc002.pdf,98304,15
...