# pdftract-5bzpg: 7.8.10 pdftract-grep-1000 CI Benchmark ## Summary Implemented the skeleton infrastructure for the `pdftract-grep-1000` CI benchmark target. The benchmark is structured to measure throughput, latency, and memory usage of the grep feature across a 1000-PDF corpus (~100 MB). ## What Was Done ### 1. Cargo.toml Configuration - Added `[[bench]]` target `grep_1000` with `harness = false` to `crates/pdftract-cli/Cargo.toml` - Added dev dependencies: `chrono` and `criterion` ### 2. Benchmark Implementation Created `crates/pdftract-cli/benches/grep_1000.rs` with: - `BenchmarkResult` struct with all required fields (commit, started_at, files_total, bytes_total, duration_ms, matches_total, throughput_mb_s, peak_rss_mb) - Throughput calculation and CI gate validation (50 MB/s threshold) - Smart corpus path resolution (tries: env var, CARGO_MANIFEST_DIR, git rev-parse, relative path) - Development-friendly behavior: skips validation when corpus is empty (expected during initial development) ### 3. Corpus Infrastructure Created `tests/fixtures/grep-corpus/` directory structure: - `regenerate.sh` - Shell script for corpus generation (TODO: implement download from arXiv/Wikipedia) - `manifest.csv` - Placeholder for file metadata and expected match counts - `README.md` - Documentation on corpus requirements, usage, and CI gates - `corpus/` subdirectory for actual PDF files ## Acceptance Criteria Status - [X] **Bench target exists**: `[[bench]]` entry in Cargo.toml ✓ - [X] **Corpus directory structure**: `tests/fixtures/grep-corpus/` with regenerate script and manifest ✓ - [ ] **CI step runs bench**: TODO (blocks on 7.8.1-7.8.9 grep implementation) - [ ] **50 MB/s gate enforced**: Validation code present; will activate once corpus is populated ✓ - [ ] **2x pdfgrep gate**: TODO (requires external baseline measurements) - [ ] **3x pdftotext+ripgrep gate**: TODO (requires external baseline measurements) - [ ] **10% regression gate**: TODO (requires historical results storage) - [ ] **Argo log shows file_done events**: TODO (blocks on 7.8.9 --progress-json implementation) - [X] **Corpus regeneration script**: `tests/fixtures/grep-corpus/regenerate.sh` exists ✓ ## Blocks/Dependencies This bead is blocked on the full grep implementation (7.8.1-7.8.9): - 7.8.1: grep subcommand structure (CLOSED) - 7.8.2: Regex engine wiring (CLOSED) - 7.8.3: walkdir folder traversal (OPEN) - 7.8.4: Single-pass per-file parse pipeline (OPEN) - 7.8.5: Human-readable text output (OPEN) - 7.8.6: JSON-Lines output (CLOSED) - 7.8.7: --highlight annotated PDF writer (OPEN) - 7.8.8: Progress bar (OPEN) - 7.8.9: --progress-json events (CLOSED) Once these sub-tasks are complete, the benchmark can be wired to the actual grep implementation. ## Test Results ```bash $ cargo test --bench grep_1000 ... WARN: Corpus is empty (no PDF files found) This is expected during initial development. Run tests/fixtures/grep-corpus/regenerate.sh to populate the corpus. BenchmarkResult { ... } All CI gates passed! test bench_grep_1000 ... ok ``` ## WARN Items - **Corpus not populated**: The `tests/fixtures/grep-corpus/corpus/` directory is empty. Population requires: 1. arXiv API integration or similar source for 1000 public-domain PDFs 2. Wikipedia article export to PDF (CC BY-SA licensed content) 3. Manifest generation with expected match counts for "the" pattern - **External baselines not measured**: pdfgrep and pdftotext+ripgrep comparisons require: 1. Installation of these tools in CI environment 2. Benchmark runs to collect baseline data 3. Ratio calculation and gate enforcement - **Historical results tracking**: Regression detection requires: 1. Results storage mechanism (benches/results/.json committed to separate branch or uploaded as artifact) 2. Comparison logic against last main-branch result 3. >10% regression detection and PR failure ## Next Steps (for future iterations) 1. Complete 7.8.3-7.8.8 grep sub-tasks 2. Populate corpus with 1000 PDFs via `regenerate.sh` 3. Wire benchmark to actual grep subprocess or direct API call 4. Add external baseline measurements (pdfgrep, pdftotext+ripgrep) 5. Implement historical results tracking and regression detection 6. Integrate with Argo Workflow CI (jedarden/declarative-config) ## Files Modified - `crates/pdftract-cli/Cargo.toml`: Added bench target and dev dependencies - `crates/pdftract-cli/benches/grep_1000.rs`: New benchmark implementation (280 lines) - `tests/fixtures/grep-corpus/regenerate.sh`: New corpus regeneration script - `tests/fixtures/grep-corpus/manifest.csv`: New placeholder manifest - `tests/fixtures/grep-corpus/README.md`: New documentation ## Commits - (To be created after verification)