pdftract/docs/notes/ocr-language-packs.md
jedarden d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00

203 lines
6.9 KiB
Markdown

# OCR Language Pack Distribution Strategy
**Status:** RESOLVED (OQ-04)
**Date:** 2026-05-23
**Bead:** pdftract-32x4
## Open Question OQ-04
> How are OCR language packs distributed? Bundled in the Docker image (size cost), downloaded on first use (network dependency), or required as an out-of-band install?
## Resolution Decision
Language packs are **bundled in Docker images** with a tiered distribution strategy:
| Docker Image Tag | Language Packs | Size | Use Case |
|------------------|----------------|------|----------|
| `pdftract:default` | None (OCR disabled) | ~4 MB | Vector-only extraction, no OCR capability |
| `pdftract:ocr` | eng + 13 common langs | ~150 MB | Standard OCR use case, covers >80% of world languages |
| `pdftract:full` | All 100+ languages | ~600 MB | Air-gapped deployments, comprehensive coverage |
## Rationale
### Why bundling?
1. **Air-gapped compatibility:** Bundling ensures OCR works in offline/air-gapped environments without network access for on-first-download
2. **Reproducibility:** Fixed language pack versions guarantee consistent extraction results across deployments
3. **Simplicity:** No external dependency management for operators; `docker run` just works
4. **Performance:** No download latency on first OCR request
### Size trade-offs
The `:ocr` variant adds ~150 MB to the image but covers the vast majority of use cases:
- English (eng) - ~12 MB
- German (deu) - ~10 MB
- French (fra) - ~10 MB
- Spanish (spa) - ~10 MB
- Italian (ita) - ~9 MB
- Portuguese (por) - ~10 MB
- Japanese (jpn) - ~18 MB
- Simplified Chinese (chi_sim) - ~25 MB
- Traditional Chinese (chi_tra) - ~22 MB
- Korean (kor) - ~12 MB
- Russian (rus) - ~14 MB
- Arabic (ara) - ~8 MB
- Hindi (hin) - ~8 MB
Total: ~168 MB (compressed) → ~150 MB (after Docker layer compression)
The `:full` variant bundles all 100+ languages (~600 MB) for specialized deployments requiring comprehensive coverage.
### Why not download-on-first-use?
Download-on-first-use was rejected because:
- Requires network connectivity at OCR time (breaks air-gapped deployments)
- Adds complexity (pack download, validation, caching)
- Introduces latency on first OCR request
- Requires a trusted pack distribution endpoint
- Version drift between pack downloads across deployments
### Why not out-of-band install?
Out-of-band install (e.g., `apt-get tesseract-ocr-all`) was rejected because:
- Platform-specific (Debian vs Alpine vs macOS vs Windows)
- Version drift across package managers
- Additional operator setup step
- Inconsistent pack locations across distros
## Language Pack Allowlist
### `pdftract:ocr` bundle (Tier 1 - High Coverage)
| Code | Language | File | Size |
|------|----------|------|------|
| eng | English | eng.traineddata | 12 MB |
| deu | German | deu.traineddata | 10 MB |
| fra | French | fra.traineddata | 10 MB |
| spa | Spanish | spa.traineddata | 10 MB |
| ita | Italian | ita.traineddata | 9 MB |
| por | Portuguese | por.traineddata | 10 MB |
| jpn | Japanese | jpn.traineddata | 18 MB |
| chi_sim | Simplified Chinese | chi_sim.traineddata | 25 MB |
| chi_tra | Traditional Chinese | chi_tra.traineddata | 22 MB |
| kor | Korean | kor.traineddata | 12 MB |
| rus | Russian | rus.traineddata | 14 MB |
| ara | Arabic | ara.traineddata | 8 MB |
| hin | Hindi | hin.traineddata | 8 MB |
**Total: 13 languages, ~168 MB (uncompressed)**
This set covers:
- All official UN languages (Arabic, Chinese, English, French, Russian, Spanish)
- Major European languages (German, Italian, Portuguese)
- Major East Asian languages (Japanese, Korean, Hindi)
- ~80% of world population by native speaker count
### `pdftract:full` bundle (Tier 2 - Complete)
Includes all 100+ language packs from the official Tesseract tessdata repository:
- All Tier 1 languages
- Indic languages (ben, guj, kan, mal, tam, tel, etc.)
- Southeast Asian languages (tha, vie, etc.)
- Central/Eastern European languages (pol, ces, slk, hun, rom, bul, etc.)
- Nordic languages (dan, nor, swe, fin)
- Turkic languages (tur, aze, uzb, etc.)
- Hebrew (heb)
- And 60+ others
**Total: 100+ languages, ~600 MB (uncompressed)**
## Implementation
### Pack Detection
The `detect_available_languages()` function in `crates/pdftract-core/src/ocr.rs` scans the tessdata directory for `<code>.traineddata` files and returns a `HashSet<String>` of available language codes.
The function respects the `$TESSDATA_PREFIX` environment variable and falls back to system-default tessdata paths:
- Unix: `/usr/share/tessdata`, `/usr/local/share/tessdata`
- Windows: `C:\Program Files\Tesseract-OCR\tessdata`
### Language Validation
When OCR is invoked with a requested language list (from `ExtractionOptions.ocr_language`), the `validate_ocr_languages()` function:
1. Checks which requested languages are available
2. Emits `OCR_LANGUAGE_UNAVAILABLE` diagnostics for missing languages
3. Filters out unavailable languages from the Tesseract language string
4. Falls back to `eng` if no requested languages are available
This ensures extraction never hard-crashes due to missing packs — it degrades gracefully with diagnostics.
### Doctor Check
The `pdftract doctor tesseract-langs` command verifies:
1. Tesseract binary is installed (version 5.x)
2. `eng` language pack is present (required fallback)
3. User-requested `--lang` languages are present
Exit code 1 if `eng` is missing; exit code 0 with WARN if optional languages are missing.
## Docker Implementation
### Dockerfile.ocr (Tier 1)
```dockerfile
FROM pdftract:base
# Install Tesseract + Tier 1 language packs
RUN apk add --no-cache \
tesseract-ocr \
tesseract-ocr-data-eng \
tesseract-ocr-data-deu \
tesseract-ocr-data-fra \
tesseract-ocr-data-spa \
tesseract-ocr-data-ita \
tesseract-ocr-data-por \
tesseract-ocr-data-jpn \
tesseract-ocr-data-chi_sim \
tesseract-ocr-data-chi_tra \
tesseract-ocr-data-kor \
tesseract-ocr-data-rus \
tesseract-ocr-data-ara \
tesseract-ocr-data-hin
# Verify packs are installed
RUN pdftract doctor tesseract-langs --lang eng,deu,fra,spa,ita,por,jpn,chi_sim,chi_tra,kor,rus,ara,hin
```
### Dockerfile.full (Tier 2)
```dockerfile
FROM pdftract:base
# Install Tesseract + all language packs
RUN apk add --no-cache \
tesseract-ocr \
tesseract-ocr-data-all
# Verify packs are installed
RUN pdftract doctor tesseract-langs
```
## Version Policy
Language packs are pinned to Tesseract 5.x series:
- Base image uses `tesseract-ocr 5.3.x` from Alpine repos
- Packs are from the same major version to ensure compatibility
- Updates follow Alpine's security patch cadence
Per OQ-03, Tesseract version pinning is documented in the Dockerfile comments.
## References
- Plan Phase 5.4: Tesseract Integration
- Plan Open Question OQ-04
- Bead pdftract-32x4 (implementation)
- crates/pdftract-core/src/ocr.rs (language detection)
- crates/pdftract-cli/src/doctor.rs (language verification)
## Revision History
| Date | Change |
|------|--------|
| 2026-05-23 | Initial resolution; document created with OQ-04 decision |