pdftract/docs/notes/ocr-language-packs.md
jedarden d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00

6.9 KiB

OCR Language Pack Distribution Strategy

Status: RESOLVED (OQ-04) Date: 2026-05-23 Bead: pdftract-32x4

Open Question OQ-04

How are OCR language packs distributed? Bundled in the Docker image (size cost), downloaded on first use (network dependency), or required as an out-of-band install?

Resolution Decision

Language packs are bundled in Docker images with a tiered distribution strategy:

Docker Image Tag Language Packs Size Use Case
pdftract:default None (OCR disabled) ~4 MB Vector-only extraction, no OCR capability
pdftract:ocr eng + 13 common langs ~150 MB Standard OCR use case, covers >80% of world languages
pdftract:full All 100+ languages ~600 MB Air-gapped deployments, comprehensive coverage

Rationale

Why bundling?

  1. Air-gapped compatibility: Bundling ensures OCR works in offline/air-gapped environments without network access for on-first-download
  2. Reproducibility: Fixed language pack versions guarantee consistent extraction results across deployments
  3. Simplicity: No external dependency management for operators; docker run just works
  4. Performance: No download latency on first OCR request

Size trade-offs

The :ocr variant adds ~150 MB to the image but covers the vast majority of use cases:

  • English (eng) - ~12 MB
  • German (deu) - ~10 MB
  • French (fra) - ~10 MB
  • Spanish (spa) - ~10 MB
  • Italian (ita) - ~9 MB
  • Portuguese (por) - ~10 MB
  • Japanese (jpn) - ~18 MB
  • Simplified Chinese (chi_sim) - ~25 MB
  • Traditional Chinese (chi_tra) - ~22 MB
  • Korean (kor) - ~12 MB
  • Russian (rus) - ~14 MB
  • Arabic (ara) - ~8 MB
  • Hindi (hin) - ~8 MB

Total: ~168 MB (compressed) → ~150 MB (after Docker layer compression)

The :full variant bundles all 100+ languages (~600 MB) for specialized deployments requiring comprehensive coverage.

Why not download-on-first-use?

Download-on-first-use was rejected because:

  • Requires network connectivity at OCR time (breaks air-gapped deployments)
  • Adds complexity (pack download, validation, caching)
  • Introduces latency on first OCR request
  • Requires a trusted pack distribution endpoint
  • Version drift between pack downloads across deployments

Why not out-of-band install?

Out-of-band install (e.g., apt-get tesseract-ocr-all) was rejected because:

  • Platform-specific (Debian vs Alpine vs macOS vs Windows)
  • Version drift across package managers
  • Additional operator setup step
  • Inconsistent pack locations across distros

Language Pack Allowlist

pdftract:ocr bundle (Tier 1 - High Coverage)

Code Language File Size
eng English eng.traineddata 12 MB
deu German deu.traineddata 10 MB
fra French fra.traineddata 10 MB
spa Spanish spa.traineddata 10 MB
ita Italian ita.traineddata 9 MB
por Portuguese por.traineddata 10 MB
jpn Japanese jpn.traineddata 18 MB
chi_sim Simplified Chinese chi_sim.traineddata 25 MB
chi_tra Traditional Chinese chi_tra.traineddata 22 MB
kor Korean kor.traineddata 12 MB
rus Russian rus.traineddata 14 MB
ara Arabic ara.traineddata 8 MB
hin Hindi hin.traineddata 8 MB

Total: 13 languages, ~168 MB (uncompressed)

This set covers:

  • All official UN languages (Arabic, Chinese, English, French, Russian, Spanish)
  • Major European languages (German, Italian, Portuguese)
  • Major East Asian languages (Japanese, Korean, Hindi)
  • ~80% of world population by native speaker count

pdftract:full bundle (Tier 2 - Complete)

Includes all 100+ language packs from the official Tesseract tessdata repository:

  • All Tier 1 languages
  • Indic languages (ben, guj, kan, mal, tam, tel, etc.)
  • Southeast Asian languages (tha, vie, etc.)
  • Central/Eastern European languages (pol, ces, slk, hun, rom, bul, etc.)
  • Nordic languages (dan, nor, swe, fin)
  • Turkic languages (tur, aze, uzb, etc.)
  • Hebrew (heb)
  • And 60+ others

Total: 100+ languages, ~600 MB (uncompressed)

Implementation

Pack Detection

The detect_available_languages() function in crates/pdftract-core/src/ocr.rs scans the tessdata directory for <code>.traineddata files and returns a HashSet<String> of available language codes.

The function respects the $TESSDATA_PREFIX environment variable and falls back to system-default tessdata paths:

  • Unix: /usr/share/tessdata, /usr/local/share/tessdata
  • Windows: C:\Program Files\Tesseract-OCR\tessdata

Language Validation

When OCR is invoked with a requested language list (from ExtractionOptions.ocr_language), the validate_ocr_languages() function:

  1. Checks which requested languages are available
  2. Emits OCR_LANGUAGE_UNAVAILABLE diagnostics for missing languages
  3. Filters out unavailable languages from the Tesseract language string
  4. Falls back to eng if no requested languages are available

This ensures extraction never hard-crashes due to missing packs — it degrades gracefully with diagnostics.

Doctor Check

The pdftract doctor tesseract-langs command verifies:

  1. Tesseract binary is installed (version 5.x)
  2. eng language pack is present (required fallback)
  3. User-requested --lang languages are present

Exit code 1 if eng is missing; exit code 0 with WARN if optional languages are missing.

Docker Implementation

Dockerfile.ocr (Tier 1)

FROM pdftract:base

# Install Tesseract + Tier 1 language packs
RUN apk add --no-cache \
    tesseract-ocr \
    tesseract-ocr-data-eng \
    tesseract-ocr-data-deu \
    tesseract-ocr-data-fra \
    tesseract-ocr-data-spa \
    tesseract-ocr-data-ita \
    tesseract-ocr-data-por \
    tesseract-ocr-data-jpn \
    tesseract-ocr-data-chi_sim \
    tesseract-ocr-data-chi_tra \
    tesseract-ocr-data-kor \
    tesseract-ocr-data-rus \
    tesseract-ocr-data-ara \
    tesseract-ocr-data-hin

# Verify packs are installed
RUN pdftract doctor tesseract-langs --lang eng,deu,fra,spa,ita,por,jpn,chi_sim,chi_tra,kor,rus,ara,hin

Dockerfile.full (Tier 2)

FROM pdftract:base

# Install Tesseract + all language packs
RUN apk add --no-cache \
    tesseract-ocr \
    tesseract-ocr-data-all

# Verify packs are installed
RUN pdftract doctor tesseract-langs

Version Policy

Language packs are pinned to Tesseract 5.x series:

  • Base image uses tesseract-ocr 5.3.x from Alpine repos
  • Packs are from the same major version to ensure compatibility
  • Updates follow Alpine's security patch cadence

Per OQ-03, Tesseract version pinning is documented in the Dockerfile comments.

References

  • Plan Phase 5.4: Tesseract Integration
  • Plan Open Question OQ-04
  • Bead pdftract-32x4 (implementation)
  • crates/pdftract-core/src/ocr.rs (language detection)
  • crates/pdftract-cli/src/doctor.rs (language verification)

Revision History

Date Change
2026-05-23 Initial resolution; document created with OQ-04 decision