pdftract/xtask
jedarden 3f8daba449 feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts
Complete scanned PDF fixtures corpus for OCR testing at 300 DPI with
paired ground-truth transcripts.

Corpus includes:
- receipt-300dpi: Single-page receipt for AS-02 scenario
- invoice-300dpi: Business invoice document
- form-300dpi: Employment application form
- doc-10page-300dpi: 10-page document for performance testing

Each fixture has:
- Vector PDF source (clean text rendering)
- Rasterized scanned PDF (simulated 300 DPI scan)
- Ground-truth transcript for WER verification

Files:
- tests/fixtures/scanned/receipt/receipt-300dpi{-scanned,.pdf,.txt}
- tests/fixtures/scanned/documents/{invoice,form}-300dpi{-scanned,.pdf,.txt}
- tests/fixtures/scanned/multi-page/doc-10page-300dpi{-scanned,.pdf,.txt}

Also added native Rust generator (xtask/src/bin/gen_scanned_fixtures.rs)
and updated generation script.

Verification: notes/bf-2he4t.md

Acceptance Criteria:
- [x] Corpus assembled with 4 fixture types
- [x] All fixtures at 300 DPI
- [x] Ground truth transcripts paired with each fixture
- [x] Files verified present and valid
- [ ] WER < 3% verified with pdftract OCR pipeline (WARN: blocked by compilation errors)

Closes bf-2he4t
2026-06-01 09:35:02 -04:00
..
src feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts 2026-06-01 09:35:02 -04:00
Cargo.lock fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
Cargo.toml fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00