jedarden d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point

Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-24 00:51:59 -04:00

6.9 KiB

Raw Blame History

OCR Language Pack Distribution Strategy

Status: RESOLVED (OQ-04) Date: 2026-05-23 Bead: pdftract-32x4

Open Question OQ-04

How are OCR language packs distributed? Bundled in the Docker image (size cost), downloaded on first use (network dependency), or required as an out-of-band install?

Resolution Decision

Language packs are bundled in Docker images with a tiered distribution strategy:

Docker Image Tag	Language Packs	Size	Use Case
`pdftract:default`	None (OCR disabled)	~4 MB	Vector-only extraction, no OCR capability
`pdftract:ocr`	eng + 13 common langs	~150 MB	Standard OCR use case, covers >80% of world languages
`pdftract:full`	All 100+ languages	~600 MB	Air-gapped deployments, comprehensive coverage

Rationale

Why bundling?

Air-gapped compatibility: Bundling ensures OCR works in offline/air-gapped environments without network access for on-first-download
Reproducibility: Fixed language pack versions guarantee consistent extraction results across deployments
Simplicity: No external dependency management for operators; docker run just works
Performance: No download latency on first OCR request

Size trade-offs

The :ocr variant adds ~150 MB to the image but covers the vast majority of use cases:

English (eng) - ~12 MB
German (deu) - ~10 MB
French (fra) - ~10 MB
Spanish (spa) - ~10 MB
Italian (ita) - ~9 MB
Portuguese (por) - ~10 MB
Japanese (jpn) - ~18 MB
Simplified Chinese (chi_sim) - ~25 MB
Traditional Chinese (chi_tra) - ~22 MB
Korean (kor) - ~12 MB
Russian (rus) - ~14 MB
Arabic (ara) - ~8 MB
Hindi (hin) - ~8 MB

Total: ~168 MB (compressed) → ~150 MB (after Docker layer compression)

The :full variant bundles all 100+ languages (~600 MB) for specialized deployments requiring comprehensive coverage.

Why not download-on-first-use?

Download-on-first-use was rejected because:

Requires network connectivity at OCR time (breaks air-gapped deployments)
Adds complexity (pack download, validation, caching)
Introduces latency on first OCR request
Requires a trusted pack distribution endpoint
Version drift between pack downloads across deployments

Why not out-of-band install?

Out-of-band install (e.g., apt-get tesseract-ocr-all) was rejected because:

Platform-specific (Debian vs Alpine vs macOS vs Windows)
Version drift across package managers
Additional operator setup step
Inconsistent pack locations across distros

Language Pack Allowlist

`pdftract:ocr` bundle (Tier 1 - High Coverage)

Code	Language	File	Size
eng	English	eng.traineddata	12 MB
deu	German	deu.traineddata	10 MB
fra	French	fra.traineddata	10 MB
spa	Spanish	spa.traineddata	10 MB
ita	Italian	ita.traineddata	9 MB
por	Portuguese	por.traineddata	10 MB
jpn	Japanese	jpn.traineddata	18 MB
chi_sim	Simplified Chinese	chi_sim.traineddata	25 MB
chi_tra	Traditional Chinese	chi_tra.traineddata	22 MB
kor	Korean	kor.traineddata	12 MB
rus	Russian	rus.traineddata	14 MB
ara	Arabic	ara.traineddata	8 MB
hin	Hindi	hin.traineddata	8 MB

Total: 13 languages, ~168 MB (uncompressed)

This set covers:

All official UN languages (Arabic, Chinese, English, French, Russian, Spanish)
Major European languages (German, Italian, Portuguese)
Major East Asian languages (Japanese, Korean, Hindi)
~80% of world population by native speaker count

`pdftract:full` bundle (Tier 2 - Complete)

Includes all 100+ language packs from the official Tesseract tessdata repository:

All Tier 1 languages
Indic languages (ben, guj, kan, mal, tam, tel, etc.)
Southeast Asian languages (tha, vie, etc.)
Central/Eastern European languages (pol, ces, slk, hun, rom, bul, etc.)
Nordic languages (dan, nor, swe, fin)
Turkic languages (tur, aze, uzb, etc.)
Hebrew (heb)
And 60+ others

Total: 100+ languages, ~600 MB (uncompressed)

Implementation

Pack Detection

The detect_available_languages() function in crates/pdftract-core/src/ocr.rs scans the tessdata directory for <code>.traineddata files and returns a HashSet<String> of available language codes.

The function respects the $TESSDATA_PREFIX environment variable and falls back to system-default tessdata paths:

Unix: /usr/share/tessdata, /usr/local/share/tessdata
Windows: C:\Program Files\Tesseract-OCR\tessdata

Language Validation

When OCR is invoked with a requested language list (from ExtractionOptions.ocr_language), the validate_ocr_languages() function:

Checks which requested languages are available
Emits OCR_LANGUAGE_UNAVAILABLE diagnostics for missing languages
Filters out unavailable languages from the Tesseract language string
Falls back to eng if no requested languages are available

This ensures extraction never hard-crashes due to missing packs — it degrades gracefully with diagnostics.

Doctor Check

The pdftract doctor tesseract-langs command verifies:

Tesseract binary is installed (version 5.x)
eng language pack is present (required fallback)
User-requested --lang languages are present

Exit code 1 if eng is missing; exit code 0 with WARN if optional languages are missing.

Docker Implementation

Dockerfile.ocr (Tier 1)

FROM pdftract:base

# Install Tesseract + Tier 1 language packs
RUN apk add --no-cache \
    tesseract-ocr \
    tesseract-ocr-data-eng \
    tesseract-ocr-data-deu \
    tesseract-ocr-data-fra \
    tesseract-ocr-data-spa \
    tesseract-ocr-data-ita \
    tesseract-ocr-data-por \
    tesseract-ocr-data-jpn \
    tesseract-ocr-data-chi_sim \
    tesseract-ocr-data-chi_tra \
    tesseract-ocr-data-kor \
    tesseract-ocr-data-rus \
    tesseract-ocr-data-ara \
    tesseract-ocr-data-hin

# Verify packs are installed
RUN pdftract doctor tesseract-langs --lang eng,deu,fra,spa,ita,por,jpn,chi_sim,chi_tra,kor,rus,ara,hin

Dockerfile.full (Tier 2)

FROM pdftract:base

# Install Tesseract + all language packs
RUN apk add --no-cache \
    tesseract-ocr \
    tesseract-ocr-data-all

# Verify packs are installed
RUN pdftract doctor tesseract-langs

Version Policy

Language packs are pinned to Tesseract 5.x series:

Base image uses tesseract-ocr 5.3.x from Alpine repos
Packs are from the same major version to ensure compatibility
Updates follow Alpine's security patch cadence

Per OQ-03, Tesseract version pinning is documented in the Dockerfile comments.

References

Plan Phase 5.4: Tesseract Integration
Plan Open Question OQ-04
Bead pdftract-32x4 (implementation)
crates/pdftract-core/src/ocr.rs (language detection)
crates/pdftract-cli/src/doctor.rs (language verification)

Revision History

Date	Change
2026-05-23	Initial resolution; document created with OQ-04 decision

6.9 KiB Raw Blame History