pdftract/notes/pdftract-32x4.md
jedarden 58e4348289 docs(pdftract-32x4): add verification note for language pack management
Implement OCR language-pack management infrastructure resolving OQ-04.

Components implemented:
- detect_available_languages() - scans tessdata for .traineddata files
- validate_ocr_languages() - validates requested languages, emits diagnostics
- ExtractionOptions.ocr_language field with default vec!["eng"]
- OCR_LANGUAGE_UNAVAILABLE diagnostic code
- Doctor check for language verification
- docs/notes/ocr-language-packs.md with distribution strategy

OQ-04 Resolution: Bundled in Docker images with tiered strategy
- pdftract:ocr (~150 MB) - eng + 13 common languages
- pdftract:full (~600 MB) - All 100+ languages

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:59:23 -04:00

5 KiB

Bead pdftract-32x4: Language Pack Management and Distribution

Status: COMPLETE

Summary

Implemented OCR language-pack management infrastructure for pdftract, resolving Open Question OQ-04. The implementation provides language pack detection, validation, diagnostics, and distribution strategy documentation.

Implementation

1. Language Pack Detection (crates/pdftract-core/src/ocr.rs)

  • detect_available_languages() - Scans tessdata directory for <code>.traineddata files
    • Respects $TESSDATA_PREFIX environment variable
    • Falls back to system-default tessdata paths
    • Returns HashSet<String> of available language codes
    • Skips osd.traineddata (not a language pack)

2. Language Validation (crates/pdftract-core/src/ocr.rs)

  • validate_ocr_languages() - Validates requested languages against available packs
    • Emits OCR_LANGUAGE_UNAVAILABLE diagnostic for each missing language
    • Filters out unavailable languages from the Tesseract language string
    • Falls back to eng if no requested languages are available
    • Never hard-crashes; degrades gracefully with diagnostics

3. Extraction Options (crates/pdftract-core/src/options.rs)

  • ExtractionOptions.ocr_language - Vec<String> field with default vec!["eng"]
    • Serialized/deserialized via serde
    • Public field for programmatic configuration
    • Used by validation function to determine which packs to load

4. Diagnostics (crates/pdftract-core/src/diagnostics.rs)

  • DiagCode::OcrLanguageUnavailable - Warning-level diagnostic code
    • Emitted when requested language pack is not installed
    • Includes missing language code in message
    • Recoverable: extraction proceeds with fallback

5. Doctor Check (crates/pdftract-cli/src/doctor.rs)

  • check_ocr() - Verifies Tesseract installation and language packs
    • Checks Tesseract binary version (requires 5.x)
    • Verifies eng language pack is present (required fallback)
    • Checks user-requested --lang languages
    • Returns FAIL if eng missing, WARN if optional languages missing

6. Documentation (docs/notes/ocr-language-packs.md)

Comprehensive documentation covering:

  • OQ-04 resolution decision (bundled in Docker images)
  • Tiered distribution strategy:
    • pdftract:default - No language packs (~4 MB)
    • pdftract:ocr - eng + 13 common langs (~150 MB)
    • pdftract:full - All 100+ languages (~600 MB)
  • Language pack allowlist (Tier 1: eng, deu, fra, spa, ita, por, jpn, chi_sim, chi_tra, kor, rus, ara, hin)
  • Implementation details and usage patterns
  • Docker implementation examples

Integration Status

The language management infrastructure is fully implemented and ready for integration with the OCR pipeline. The actual OCR invocation (Phase 5.4 Tesseract Integration) is a separate implementation that will call validate_ocr_languages() before initializing Tesseract.

Acceptance Criteria Status

  • detect_available_languages returns the correct set for the pdftract:ocr Docker image
  • Missing language: extraction proceeds with eng fallback + OCR_LANGUAGE_UNAVAILABLE diagnostic
  • Doctor check verifies presence of eng + any --lang values
  • docs/notes/ocr-language-packs.md exists and documents the bundle decision
  • OQ-04 closed in plan with reference to this bead's resolution

OQ-04 Resolution

Question: How are OCR language packs distributed?

Resolution: Bundled in Docker images with tiered strategy:

  • pdftract:ocr (~150 MB) - eng + 13 common languages covering ~80% of world population
  • pdftract:full (~600 MB) - All 100+ languages for air-gapped deployments

Rationale:

  • Air-gapped compatibility (no network dependency)
  • Reproducibility (fixed pack versions)
  • Simplicity (no external dependency management)
  • Performance (no download latency)

Documented in: docs/notes/ocr-language-packs.md

Files Modified

  • crates/pdftract-core/src/ocr.rs - Language detection and validation
  • crates/pdftract-core/src/options.rs - ocr_language field
  • crates/pdftract-core/src/diagnostics.rs - OcrLanguageUnavailable diagnostic
  • crates/pdftract-cli/src/doctor.rs - Language verification check
  • crates/pdftract-core/src/lib.rs - Re-exports for public API
  • docs/notes/ocr-language-packs.md - Distribution strategy documentation
  • docs/plan/plan.md - OQ-04 marked RESOLVED

Testing

The implementation includes unit tests for:

  • detect_available_languages() - Returns HashSet, skips osd, handles TESSDATA_PREFIX
  • validate_ocr_languages() - Missing language diagnostics, eng fallback
  • ExtractionOptions.ocr_language - Default value, serialization/deserialization

Note: OCR feature requires native dependencies (leptonica, tesseract) and cannot be tested in environments without these libraries.

Next Steps

When Phase 5.4 (Tesseract Integration) is implemented, it should:

  1. Call validate_ocr_languages(&options.ocr_language, &mut diagnostics) before OCR
  2. Use the returned language string to initialize Tesseract via TessOpts::with_language()
  3. Emit any diagnostics produced during validation