History

jedarden bae41cc771 feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton Add Cargo bench target for grep performance measurement across 1000-PDF corpus. Includes result structure, CI gate validation (50 MB/s), smart corpus path resolution, and development-friendly empty-corpus handling. Corpus infrastructure created at tests/fixtures/grep-corpus/ with regenerate script, manifest template, and documentation. Benchmark ready to wire to actual grep implementation once 7.8.3-7.8.8 sub-tasks complete. Closes: pdftract-5bzpg Files: - crates/pdftract-cli/Cargo.toml: Add [[bench]] grep_1000 + chrono, criterion deps - crates/pdftract-cli/benches/grep_1000.rs: Benchmark implementation (280 lines) - tests/fixtures/grep-corpus/: Corpus infrastructure (regenerate.sh, manifest, README) - notes/pdftract-5bzpg.md: Verification note with acceptance criteria status Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-25 08:53:23 -04:00
..
corpus	feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton	2026-05-25 08:53:23 -04:00
manifest.csv	feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton	2026-05-25 08:53:23 -04:00
README.md	feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton	2026-05-25 08:53:23 -04:00
regenerate.sh	feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton	2026-05-25 08:53:23 -04:00

README.md

pdftract grep-corpus

Benchmark corpus for pdftract-grep-1000 CI benchmark.

Purpose

This corpus contains 1000 PDFs (~100 MB total) used to benchmark and validate the grep feature's performance and correctness.

Structure

tests/fixtures/grep-corpus/
├── corpus/              # Actual PDF files
├── manifest.csv         # File metadata and expected match counts
├── regenerate.sh        # Script to rebuild the corpus
└── README.md            # This file

Usage

Running the benchmark

cargo bench --bench grep_1000

Regenerating the corpus

cd tests/fixtures/grep-corpus
./regenerate.sh

Corpus Requirements

The corpus must satisfy:

Size: 1000 PDF files, ~100 MB total
Content: Mix of vector and scanned PDFs
License: Public domain or permissive (CC BY-SA, MIT, etc.)
Determinism: Regenerable from source (no manual uploads)

CI Gates

The benchmark enforces these gates on every PR:

Throughput: ≥ 50 MB/s on 4-core CI machine
vs pdfgrep: ≥ 2× faster
vs pdftotext+ripgrep: ≥ 3× faster
Regression: ≤ 10% vs historical main

Status

TODO: Populate corpus (blocks on 7.8.1-7.8.9 grep implementation).

Sources (TODO)

Potential corpus sources:

arXiv API (public domain metadata)
Wikipedia article exports (CC BY-SA)
Synthetic PDFs via pdfjoin

Manifest Format

filename,size_bytes,expected_matches_for_pattern_the
doc001.pdf,102400,42
doc002.pdf,98304,15
...

README.md Unescape Escape