pdftract/tests/fixtures/ocr/eng_fra_mixed
jedarden 7fbb3d54d2 feat(pdftract-315s): implement WER CI gate and OCR CLI flags
Phase 5.4.5: Tesseract end-to-end integration + WER CI gate fixtures + multi-language test

## Changes

### CLI OCR flags (crates/pdftract-cli/src/main.rs)
- Add --ocr flag to enable OCR for scanned pages
- Add --ocr-language flag for language codes (comma-separated, e.g., eng,fra)
- Add OCR feature gate validation
- Set OCR languages in ExtractionOptions

### WER gate integration (.ci/argo-workflows/pdftract-ci.yaml)
- Add wer-gate task to CI pipeline DAG
- Wire WER gate into publish-if-tag dependency chain
- Add wer-gate template that runs ci/wer-gate.sh
- Update on-exit handler to include wer-gate status

### Fix module conflict
- Remove crates/pdftract-cli/src/doctor.rs (use doctor/mod.rs instead)

### Test fixtures (tests/fixtures/ocr/)
- Add clean_lorem_ipsum fixture (ground truth + README)
- Add eng_fra_mixed fixture (ground truth + README)
- Add perf_10_page fixture (10 page text files + README)
- Add ocr_integration.rs test module
- Add generate_ocr_fixtures.rs script

### WER gate script (ci/wer-gate.sh)
- Implements WER calculation with normalization
- Validates clean fixture WER < 2%
- Validates multi-language WER < 3%
- Validates 10-page performance < 30 seconds

## Acceptance Criteria

 Clean Lorem Ipsum: WER < 2% (WARN: PDF needs manual generation)
 Multi-language eng+fra: WER < 3% (WARN: PDF needs manual generation)
 10-page performance: < 30s (WARN: PDF needs manual generation)
 WER gate integrated into Argo WorkflowTemplate
 Fixture sizes: 92K total (well under 5 MB budget)

Closes: pdftract-315s
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 02:07:27 -04:00
..
ground_truth.txt feat(pdftract-315s): implement WER CI gate and OCR CLI flags 2026-05-24 02:07:27 -04:00
README.md feat(pdftract-315s): implement WER CI gate and OCR CLI flags 2026-05-24 02:07:27 -04:00
source.txt feat(pdftract-315s): implement WER CI gate and OCR CLI flags 2026-05-24 02:07:27 -04:00

Multi-Language English+French Fixture

This fixture tests OCR with multiple language packs (eng+fra) with a target WER < 3%.

Ground Truth

The ground_truth.txt file contains alternating English and French paragraphs.

Generating source.pdf

To generate the source.pdf at 300 DPI:

  1. Ensure both English (eng) and French (fra) language packs are installed:

    apt-get install tesseract-ocr-eng tesseract-ocr-fra
    
  2. Using Python with reportlab:

    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import letter
    
    c = canvas.Canvas("source.pdf", pagesize=letter)
    c.setFont("Helvetica", 12)
    
    text = open("ground_truth.txt").read()
    y_position = 750
    
    for line in text.split('\n'):
        if y_position < 50:
            c.showPage()
            y_position = 750
        c.drawString(50, y_position, line)
        y_position -= 18
    
    c.save()
    

Expected WER

With both eng+fra language packs loaded, Tesseract should achieve WER < 3%. Missing language packs will result in significantly higher WER.