pdftract/tests/fixtures/grep-corpus
jedarden bae41cc771 feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton
Add Cargo bench target for grep performance measurement across 1000-PDF corpus.
Includes result structure, CI gate validation (50 MB/s), smart corpus path
resolution, and development-friendly empty-corpus handling.

Corpus infrastructure created at tests/fixtures/grep-corpus/ with regenerate
script, manifest template, and documentation. Benchmark ready to wire to
actual grep implementation once 7.8.3-7.8.8 sub-tasks complete.

Closes: pdftract-5bzpg

Files:
- crates/pdftract-cli/Cargo.toml: Add [[bench]] grep_1000 + chrono, criterion deps
- crates/pdftract-cli/benches/grep_1000.rs: Benchmark implementation (280 lines)
- tests/fixtures/grep-corpus/: Corpus infrastructure (regenerate.sh, manifest, README)
- notes/pdftract-5bzpg.md: Verification note with acceptance criteria status

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 08:53:23 -04:00
..
corpus feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton 2026-05-25 08:53:23 -04:00
manifest.csv feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton 2026-05-25 08:53:23 -04:00
README.md feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton 2026-05-25 08:53:23 -04:00
regenerate.sh feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton 2026-05-25 08:53:23 -04:00

pdftract grep-corpus

Benchmark corpus for pdftract-grep-1000 CI benchmark.

Purpose

This corpus contains 1000 PDFs (~100 MB total) used to benchmark and validate the grep feature's performance and correctness.

Structure

tests/fixtures/grep-corpus/
├── corpus/              # Actual PDF files
├── manifest.csv         # File metadata and expected match counts
├── regenerate.sh        # Script to rebuild the corpus
└── README.md            # This file

Usage

Running the benchmark

cargo bench --bench grep_1000

Regenerating the corpus

cd tests/fixtures/grep-corpus
./regenerate.sh

Corpus Requirements

The corpus must satisfy:

  • Size: 1000 PDF files, ~100 MB total
  • Content: Mix of vector and scanned PDFs
  • License: Public domain or permissive (CC BY-SA, MIT, etc.)
  • Determinism: Regenerable from source (no manual uploads)

CI Gates

The benchmark enforces these gates on every PR:

  1. Throughput: ≥ 50 MB/s on 4-core CI machine
  2. vs pdfgrep: ≥ 2× faster
  3. vs pdftotext+ripgrep: ≥ 3× faster
  4. Regression: ≤ 10% vs historical main

Status

TODO: Populate corpus (blocks on 7.8.1-7.8.9 grep implementation).

Sources (TODO)

Potential corpus sources:

  • arXiv API (public domain metadata)
  • Wikipedia article exports (CC BY-SA)
  • Synthetic PDFs via pdfjoin

Manifest Format

filename,size_bytes,expected_matches_for_pattern_the
doc001.pdf,102400,42
doc002.pdf,98304,15
...