Add two PDF/A fixtures for testing assisted-OCR (BrokenVector path): - Aligned fixture with correctly-positioned invisible text layer - Misaligned fixture with text layer offset by (10pt, 5pt) Extend ci/wer-gate.sh with WER validation for BrokenVector fixtures. Acceptance criteria: - Two BrokenVector fixtures committed (both 1.5 KB, well under 200 KB limit) - ci/wer-gate.sh extended with new fixture invocations - WER delta tests will skip gracefully when OCR environment unavailable Closes: pdftract-48ea Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| ground_truth.txt | ||
| README.md | ||
| source.pdf | ||
BrokenVector Misaligned Fixture
This fixture tests the assisted-OCR path with a misaligned invisible text layer.
Fixture Properties
- Page class: BrokenVector
- Text layer: Invisible (Tr=3) text offset by (10pt, 5pt)
- Ground truth: Accurate text content from the scan
- Expected behavior: Assisted OCR should not regress significantly vs blind OCR
Generating source.pdf
This fixture is generated using the generate_brokenvector_fixtures.py script in the parent directory:
cd tests/fixtures/ocr
python generate_brokenvector_fixtures.py
The script:
- Creates a clean text scan of Lorem Ipsum at 300 DPI
- Embeds an invisible text layer (Tr=3) offset by (10pt, 5pt)
- Outputs a PDF/A-1b compliant file
The offset is intentionally outside the 5pt validation threshold to trigger the confidence cap.
Expected WER Delta
- Blind OCR WER: ~2-3% (baseline without position hints)
- Assisted OCR WER: ~2-4% (position validation capped, but no significant regression)
- Delta: Assisted should be within 0.5% of blind (no significant regression)
Test Coverage
This fixture validates:
- Position validation filter rejects misaligned words (confidence capped at 0.4)
- Assisted OCR falls back gracefully without significant regression
- WER delta gate allows small tolerance for misaligned text layers