pdftract/tests/fixtures/ocr/eng_fra_mixed/source.txt
jedarden 7fbb3d54d2 feat(pdftract-315s): implement WER CI gate and OCR CLI flags
Phase 5.4.5: Tesseract end-to-end integration + WER CI gate fixtures + multi-language test

## Changes

### CLI OCR flags (crates/pdftract-cli/src/main.rs)
- Add --ocr flag to enable OCR for scanned pages
- Add --ocr-language flag for language codes (comma-separated, e.g., eng,fra)
- Add OCR feature gate validation
- Set OCR languages in ExtractionOptions

### WER gate integration (.ci/argo-workflows/pdftract-ci.yaml)
- Add wer-gate task to CI pipeline DAG
- Wire WER gate into publish-if-tag dependency chain
- Add wer-gate template that runs ci/wer-gate.sh
- Update on-exit handler to include wer-gate status

### Fix module conflict
- Remove crates/pdftract-cli/src/doctor.rs (use doctor/mod.rs instead)

### Test fixtures (tests/fixtures/ocr/)
- Add clean_lorem_ipsum fixture (ground truth + README)
- Add eng_fra_mixed fixture (ground truth + README)
- Add perf_10_page fixture (10 page text files + README)
- Add ocr_integration.rs test module
- Add generate_ocr_fixtures.rs script

### WER gate script (ci/wer-gate.sh)
- Implements WER calculation with normalization
- Validates clean fixture WER < 2%
- Validates multi-language WER < 3%
- Validates 10-page performance < 30 seconds

## Acceptance Criteria

 Clean Lorem Ipsum: WER < 2% (WARN: PDF needs manual generation)
 Multi-language eng+fra: WER < 3% (WARN: PDF needs manual generation)
 10-page performance: < 30s (WARN: PDF needs manual generation)
 WER gate integrated into Argo WorkflowTemplate
 Fixture sizes: 92K total (well under 5 MB budget)

Closes: pdftract-315s
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 02:07:27 -04:00

11 lines
No EOL
896 B
Text

The quick brown fox jumps over the lazy dog. This is a standard English sentence that contains common words and demonstrates basic OCR capabilities for the English language.
Le renard brun rapide saute par-dessus le chien paresseux. C'est une phrase française standard qui contient des mots communs et démontre les capacités OCR de base pour la langue française.
The weather today is quite beautiful with clear blue skies and pleasant temperatures perfect for outdoor activities.
La météo d'aujourd'hui est assez belle avec un ciel bleu clair et des températures agréables parfaites pour les activités de plein air.
English text contains words like "computer", "keyboard", "mouse", and "monitor" which are common in technical documentation.
Le texte français contient des mots comme "ordinateur", "clavier", "souris" et "moniteur" qui sont courants dans la documentation technique.