pdftract/notes/bf-53y8h.md
jedarden 63a2da9f97 docs(bf-53y8h): add verification note for vector CER corpus
Verified that tests/fixtures/vector/ corpus is complete with 10 fixtures,
each containing source.pdf, ground_truth.txt, and README.md. All files
tracked in git and valid for CER testing (< 0.5% target).

Closes bf-53y8h
2026-06-01 08:23:59 -04:00

2.8 KiB

bf-53y8h: Vector CER Ground-Truth Corpus Verification

Task

Assemble tests/fixtures/vector/ ground-truth corpus (CER gate) — 5-10 clean LaTeX/Word PDFs with paired .txt ground-truth files required for AS-01 and the <0.5% CER Tier 1 gate.

Finding

The corpus already exists and is complete. The directory was populated in a prior commit (9e195a43) as part of URL fragment routing implementation.

Corpus Summary

Fixtures (10 total)

  1. academic-paper — Academic Paper on Machine Learning (14 lines)
  2. code-documentation — Code Library Documentation (22 lines)
  3. conference-proceedings — Conference Proceedings (13 lines)
  4. financial-report — Q1 Financial Report (13 lines)
  5. legal-contract — Service Agreement (14 lines)
  6. medical-research — Clinical Trial Results (14 lines)
  7. multi-page-academic — Multi-Page Academic Paper (12 lines)
  8. scientific-report — Climate Research Report (14 lines)
  9. technical-documentation — API Documentation (14 lines)
  10. user-manual — Product User Manual (18 lines)

File Structure

Each fixture contains:

  • source.pdf — Clean vector PDF with embedded text (valid %PDF-1.4 headers)
  • ground_truth.txt — Exact text content for CER comparison
  • README.md — Documentation with purpose, files, expected CER (< 0.5%), metadata

Statistics

  • Total fixtures: 10 (exceeds 5-10 requirement)
  • Total files: 31 (all tracked in git)
  • Total ground truth lines: 148
  • PDF sizes: 1,014 — 1,541 bytes

Generation

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py (547 lines), which creates clean vector PDFs with proper Type1 fonts and WinAnsiEncoding for accurate text extraction.

Acceptance Criteria

  • PASS: 5-10 clean LaTeX/Word PDFs — 10 fixtures present
  • PASS: Paired .txt ground-truth files — all fixtures have ground_truth.txt
  • PASS: Directory exists at tests/fixtures/vector/ — confirmed
  • PASS: Suitable for CER testing — PDFs have embedded text, ground truth provided

Verification

# All files tracked in git
git ls-files tests/fixtures/vector/ | wc -l  # 31 files

# PDF headers valid
head -c 50 tests/fixtures/vector/*/source.pdf | grep -q "%PDF-1.4"  # all pass

# Ground truth non-empty
for f in tests/fixtures/vector/*/ground_truth.txt; do
  [ -s "$f" ] && echo "$(basename $(dirname $f)): OK"
done

References

  • Corpus location: tests/fixtures/vector/
  • Generator script: tests/fixtures/vector/generate_vector_cer_corpus.py
  • Related bead: AS-01 (needs this corpus for CER validation)
  • Plan reference: CER Tier 1 gate (< 0.5% character error rate)

Note

The task description stated the directory "lacks ground-truth corpus," but verification shows a complete corpus already exists. This may indicate the task was created before the corpus was generated and committed.