pdftract/tests/fixtures/ocr/brokenvector_misaligned
jedarden 05be70d36f feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate
Add two PDF/A fixtures for testing assisted-OCR (BrokenVector path):
- Aligned fixture with correctly-positioned invisible text layer
- Misaligned fixture with text layer offset by (10pt, 5pt)

Extend ci/wer-gate.sh with WER validation for BrokenVector fixtures.

Acceptance criteria:
- Two BrokenVector fixtures committed (both 1.5 KB, well under 200 KB limit)
- ci/wer-gate.sh extended with new fixture invocations
- WER delta tests will skip gracefully when OCR environment unavailable

Closes: pdftract-48ea

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 10:52:41 -04:00
..
ground_truth.txt feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate 2026-05-24 10:52:41 -04:00
README.md feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate 2026-05-24 10:52:41 -04:00
source.pdf feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate 2026-05-24 10:52:41 -04:00

BrokenVector Misaligned Fixture

This fixture tests the assisted-OCR path with a misaligned invisible text layer.

Fixture Properties

  • Page class: BrokenVector
  • Text layer: Invisible (Tr=3) text offset by (10pt, 5pt)
  • Ground truth: Accurate text content from the scan
  • Expected behavior: Assisted OCR should not regress significantly vs blind OCR

Generating source.pdf

This fixture is generated using the generate_brokenvector_fixtures.py script in the parent directory:

cd tests/fixtures/ocr
python generate_brokenvector_fixtures.py

The script:

  1. Creates a clean text scan of Lorem Ipsum at 300 DPI
  2. Embeds an invisible text layer (Tr=3) offset by (10pt, 5pt)
  3. Outputs a PDF/A-1b compliant file

The offset is intentionally outside the 5pt validation threshold to trigger the confidence cap.

Expected WER Delta

  • Blind OCR WER: ~2-3% (baseline without position hints)
  • Assisted OCR WER: ~2-4% (position validation capped, but no significant regression)
  • Delta: Assisted should be within 0.5% of blind (no significant regression)

Test Coverage

This fixture validates:

  • Position validation filter rejects misaligned words (confidence capped at 0.4)
  • Assisted OCR falls back gracefully without significant regression
  • WER delta gate allows small tolerance for misaligned text layers