Add unified detect() method to TableDetector that combines both line-based and borderless table detection pipelines. This completes the coordinator bead for Phase 7.2: Table Detection and Structure Reconstruction. All child beads (7.2.1-7.2.6) are closed: - 7.2.1: Line-based detection (path segment clustering) - 7.2.2: Borderless detection (x0 alignment heuristic) - 7.2.3: Span-to-cell assignment (centroid containment) - 7.2.4: Header row detection (bold + StructTree TH) - 7.2.5: Merged cell detection (missing interior edges) - 7.2.6: Table JSON output schema integration Critical tests pass: - 5x3 bordered table (15 cells extracted) - Merged header cell colspan=3 - Borderless 3-column table detection - Two-page table continuation detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
203 lines
6.9 KiB
Markdown
203 lines
6.9 KiB
Markdown
# OCR Language Pack Distribution Strategy
|
|
|
|
**Status:** RESOLVED (OQ-04)
|
|
**Date:** 2026-05-23
|
|
**Bead:** pdftract-32x4
|
|
|
|
## Open Question OQ-04
|
|
|
|
> How are OCR language packs distributed? Bundled in the Docker image (size cost), downloaded on first use (network dependency), or required as an out-of-band install?
|
|
|
|
## Resolution Decision
|
|
|
|
Language packs are **bundled in Docker images** with a tiered distribution strategy:
|
|
|
|
| Docker Image Tag | Language Packs | Size | Use Case |
|
|
|------------------|----------------|------|----------|
|
|
| `pdftract:default` | None (OCR disabled) | ~4 MB | Vector-only extraction, no OCR capability |
|
|
| `pdftract:ocr` | eng + 13 common langs | ~150 MB | Standard OCR use case, covers >80% of world languages |
|
|
| `pdftract:full` | All 100+ languages | ~600 MB | Air-gapped deployments, comprehensive coverage |
|
|
|
|
## Rationale
|
|
|
|
### Why bundling?
|
|
|
|
1. **Air-gapped compatibility:** Bundling ensures OCR works in offline/air-gapped environments without network access for on-first-download
|
|
2. **Reproducibility:** Fixed language pack versions guarantee consistent extraction results across deployments
|
|
3. **Simplicity:** No external dependency management for operators; `docker run` just works
|
|
4. **Performance:** No download latency on first OCR request
|
|
|
|
### Size trade-offs
|
|
|
|
The `:ocr` variant adds ~150 MB to the image but covers the vast majority of use cases:
|
|
- English (eng) - ~12 MB
|
|
- German (deu) - ~10 MB
|
|
- French (fra) - ~10 MB
|
|
- Spanish (spa) - ~10 MB
|
|
- Italian (ita) - ~9 MB
|
|
- Portuguese (por) - ~10 MB
|
|
- Japanese (jpn) - ~18 MB
|
|
- Simplified Chinese (chi_sim) - ~25 MB
|
|
- Traditional Chinese (chi_tra) - ~22 MB
|
|
- Korean (kor) - ~12 MB
|
|
- Russian (rus) - ~14 MB
|
|
- Arabic (ara) - ~8 MB
|
|
- Hindi (hin) - ~8 MB
|
|
|
|
Total: ~168 MB (compressed) → ~150 MB (after Docker layer compression)
|
|
|
|
The `:full` variant bundles all 100+ languages (~600 MB) for specialized deployments requiring comprehensive coverage.
|
|
|
|
### Why not download-on-first-use?
|
|
|
|
Download-on-first-use was rejected because:
|
|
- Requires network connectivity at OCR time (breaks air-gapped deployments)
|
|
- Adds complexity (pack download, validation, caching)
|
|
- Introduces latency on first OCR request
|
|
- Requires a trusted pack distribution endpoint
|
|
- Version drift between pack downloads across deployments
|
|
|
|
### Why not out-of-band install?
|
|
|
|
Out-of-band install (e.g., `apt-get tesseract-ocr-all`) was rejected because:
|
|
- Platform-specific (Debian vs Alpine vs macOS vs Windows)
|
|
- Version drift across package managers
|
|
- Additional operator setup step
|
|
- Inconsistent pack locations across distros
|
|
|
|
## Language Pack Allowlist
|
|
|
|
### `pdftract:ocr` bundle (Tier 1 - High Coverage)
|
|
|
|
| Code | Language | File | Size |
|
|
|------|----------|------|------|
|
|
| eng | English | eng.traineddata | 12 MB |
|
|
| deu | German | deu.traineddata | 10 MB |
|
|
| fra | French | fra.traineddata | 10 MB |
|
|
| spa | Spanish | spa.traineddata | 10 MB |
|
|
| ita | Italian | ita.traineddata | 9 MB |
|
|
| por | Portuguese | por.traineddata | 10 MB |
|
|
| jpn | Japanese | jpn.traineddata | 18 MB |
|
|
| chi_sim | Simplified Chinese | chi_sim.traineddata | 25 MB |
|
|
| chi_tra | Traditional Chinese | chi_tra.traineddata | 22 MB |
|
|
| kor | Korean | kor.traineddata | 12 MB |
|
|
| rus | Russian | rus.traineddata | 14 MB |
|
|
| ara | Arabic | ara.traineddata | 8 MB |
|
|
| hin | Hindi | hin.traineddata | 8 MB |
|
|
|
|
**Total: 13 languages, ~168 MB (uncompressed)**
|
|
|
|
This set covers:
|
|
- All official UN languages (Arabic, Chinese, English, French, Russian, Spanish)
|
|
- Major European languages (German, Italian, Portuguese)
|
|
- Major East Asian languages (Japanese, Korean, Hindi)
|
|
- ~80% of world population by native speaker count
|
|
|
|
### `pdftract:full` bundle (Tier 2 - Complete)
|
|
|
|
Includes all 100+ language packs from the official Tesseract tessdata repository:
|
|
- All Tier 1 languages
|
|
- Indic languages (ben, guj, kan, mal, tam, tel, etc.)
|
|
- Southeast Asian languages (tha, vie, etc.)
|
|
- Central/Eastern European languages (pol, ces, slk, hun, rom, bul, etc.)
|
|
- Nordic languages (dan, nor, swe, fin)
|
|
- Turkic languages (tur, aze, uzb, etc.)
|
|
- Hebrew (heb)
|
|
- And 60+ others
|
|
|
|
**Total: 100+ languages, ~600 MB (uncompressed)**
|
|
|
|
## Implementation
|
|
|
|
### Pack Detection
|
|
|
|
The `detect_available_languages()` function in `crates/pdftract-core/src/ocr.rs` scans the tessdata directory for `<code>.traineddata` files and returns a `HashSet<String>` of available language codes.
|
|
|
|
The function respects the `$TESSDATA_PREFIX` environment variable and falls back to system-default tessdata paths:
|
|
- Unix: `/usr/share/tessdata`, `/usr/local/share/tessdata`
|
|
- Windows: `C:\Program Files\Tesseract-OCR\tessdata`
|
|
|
|
### Language Validation
|
|
|
|
When OCR is invoked with a requested language list (from `ExtractionOptions.ocr_language`), the `validate_ocr_languages()` function:
|
|
|
|
1. Checks which requested languages are available
|
|
2. Emits `OCR_LANGUAGE_UNAVAILABLE` diagnostics for missing languages
|
|
3. Filters out unavailable languages from the Tesseract language string
|
|
4. Falls back to `eng` if no requested languages are available
|
|
|
|
This ensures extraction never hard-crashes due to missing packs — it degrades gracefully with diagnostics.
|
|
|
|
### Doctor Check
|
|
|
|
The `pdftract doctor tesseract-langs` command verifies:
|
|
1. Tesseract binary is installed (version 5.x)
|
|
2. `eng` language pack is present (required fallback)
|
|
3. User-requested `--lang` languages are present
|
|
|
|
Exit code 1 if `eng` is missing; exit code 0 with WARN if optional languages are missing.
|
|
|
|
## Docker Implementation
|
|
|
|
### Dockerfile.ocr (Tier 1)
|
|
|
|
```dockerfile
|
|
FROM pdftract:base
|
|
|
|
# Install Tesseract + Tier 1 language packs
|
|
RUN apk add --no-cache \
|
|
tesseract-ocr \
|
|
tesseract-ocr-data-eng \
|
|
tesseract-ocr-data-deu \
|
|
tesseract-ocr-data-fra \
|
|
tesseract-ocr-data-spa \
|
|
tesseract-ocr-data-ita \
|
|
tesseract-ocr-data-por \
|
|
tesseract-ocr-data-jpn \
|
|
tesseract-ocr-data-chi_sim \
|
|
tesseract-ocr-data-chi_tra \
|
|
tesseract-ocr-data-kor \
|
|
tesseract-ocr-data-rus \
|
|
tesseract-ocr-data-ara \
|
|
tesseract-ocr-data-hin
|
|
|
|
# Verify packs are installed
|
|
RUN pdftract doctor tesseract-langs --lang eng,deu,fra,spa,ita,por,jpn,chi_sim,chi_tra,kor,rus,ara,hin
|
|
```
|
|
|
|
### Dockerfile.full (Tier 2)
|
|
|
|
```dockerfile
|
|
FROM pdftract:base
|
|
|
|
# Install Tesseract + all language packs
|
|
RUN apk add --no-cache \
|
|
tesseract-ocr \
|
|
tesseract-ocr-data-all
|
|
|
|
# Verify packs are installed
|
|
RUN pdftract doctor tesseract-langs
|
|
```
|
|
|
|
## Version Policy
|
|
|
|
Language packs are pinned to Tesseract 5.x series:
|
|
- Base image uses `tesseract-ocr 5.3.x` from Alpine repos
|
|
- Packs are from the same major version to ensure compatibility
|
|
- Updates follow Alpine's security patch cadence
|
|
|
|
Per OQ-03, Tesseract version pinning is documented in the Dockerfile comments.
|
|
|
|
## References
|
|
|
|
- Plan Phase 5.4: Tesseract Integration
|
|
- Plan Open Question OQ-04
|
|
- Bead pdftract-32x4 (implementation)
|
|
- crates/pdftract-core/src/ocr.rs (language detection)
|
|
- crates/pdftract-cli/src/doctor.rs (language verification)
|
|
|
|
## Revision History
|
|
|
|
| Date | Change |
|
|
|------|--------|
|
|
| 2026-05-23 | Initial resolution; document created with OQ-04 decision |
|