History

jedarden 8fe61a1ba5 docs(pdftract-25k4x): add verification note for figure/caption detection		2026-06-01 09:35:02 -04:00
..
documents	docs(pdftract-25k4x): add verification note for figure/caption detection	2026-06-01 09:35:02 -04:00
multi-page	docs(pdftract-25k4x): add verification note for figure/caption detection	2026-06-01 09:35:02 -04:00
receipt	docs(pdftract-25k4x): add verification note for figure/caption detection	2026-06-01 09:35:02 -04:00
calculate_wer.py	feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts	2026-06-01 09:25:53 -04:00
GEN_MANIFEST.md	feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts	2026-06-01 09:25:53 -04:00
generate_scanned_fixtures.py	docs(pdftract-25k4x): add verification note for figure/caption detection	2026-06-01 09:35:02 -04:00
generate_scanned_fixtures.rs	feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts	2026-06-01 09:25:53 -04:00
README.md	feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts	2026-06-01 09:25:53 -04:00
run_gen.sh	docs(pdftract-25k4x): add verification note for figure/caption detection	2026-06-01 09:35:02 -04:00
wer_gate_stub.rs	feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts	2026-06-01 09:25:53 -04:00

README.md

Scanned PDF Fixtures for OCR Testing

This directory contains scanned PDF fixtures with ground-truth transcripts for Word Error Rate (WER) testing.

Purpose

These fixtures support:

AS-02 test scenario: Extract a scanned receipt via OCR
Tier 1 OCR gate: WER < 3% on clean 300-DPI scans
Performance testing: 10-page scanned PDF extraction in < 30 seconds

Directory Structure

scanned/
├── README.md                     # This file
├── receipt/                      # Single-page receipt fixtures
│   ├── receipt-300dpi.pdf       # Clean receipt at 300 DPI
│   └── receipt-300dpi.txt       # Ground truth transcript
├── documents/                    # Various document type fixtures
│   ├── invoice-300dpi.pdf
│   ├── invoice-300dpi.txt
│   ├── form-300dpi.pdf
│   └── form-300dpi.txt
└── multi-page/                   # Multi-page fixtures for performance testing
    ├── doc-10page-300dpi.pdf
    └── doc-10page-300dpi.txt

Generation Instructions

Use the provided generation script to create scanned PDFs:

# Install dependencies
# Python 3 with reportlab, PIL/Pillow, img2pdf
pip3 install reportlab Pillow img2pdf

# Generate all fixtures
cd tests/fixtures/scanned
python3 generate_scanned_fixtures.py

For manual generation:

Create a PDF from the .txt ground truth file using a Tesseract-friendly font (Arial, Helvetica, Times New Roman)
Set font size to 12pt for good OCR readability
Use 300 DPI for the scan
Apply minimal preprocessing (no aggressive compression)

WER Targets

Clean 300-DPI scans: WER < 3%
Receipts: WER < 3% (critical for totals, line items)
Multi-page documents: Average WER < 3%, no page > 5%

Verification

To verify WER on a fixture:

# Extract text with pdftract
pdftract extract tests/fixtures/scanned/receipt/receipt-300dpi.pdf --ocr --text > output.txt

# Compute WER (requires jiwer or similar)
python3 -c "
from jiwer import wer
with open('tests/fixtures/scanned/receipt/receipt-300dpi.txt') as f:
    ground_truth = f.read()
with open('output.txt') as f:
    hypothesis = f.read()
print(f'WER: {wer(ground_truth, hypothesis):.2%}')
"

Fixtures Status

Fixture	PDF	Ground Truth	WER Target	Status
receipt-300dpi	❌	✅	< 3%	PDF needed
invoice-300dpi	❌	✅	< 3%	PDF needed
form-300dpi	❌	✅	< 3%	PDF needed
doc-10page-300dpi	❌	✅	< 3% avg	PDF needed

Adding New Fixtures

Create the ground truth .txt file with the exact content
Generate the corresponding .pdf using the generation script or manually
Add the fixture to this README's table
Update generation script if applicable

Notes

All fixtures use English language with Tesseract eng traineddata
Fonts should be standard: Arial, Helvetica, Times New Roman, or Courier
Avoid decorative fonts, handwriting, or unusual layouts for baseline fixtures
For challenging fixtures, consider creating a separate challenging/ subdirectory