pdftract/docs/research/language-detection-and-script-handling.md

# Language Detection and Script Handling in pdftract

## Overview

Multilingual PDF documents expose three distinct problems for a text extraction library: identifying which Unicode script a sequence of codepoints belongs to, reconstructing logical order from glyphs that may have been stored in visual order, and normalizing script-specific presentation variants to canonical Unicode forms. This document covers each problem, the relevant standards, and the implementation strategy for `pdftract`.

---

## 1. Script Detection from Glyph Data

### Unicode Script Property (UAX #24)

Every Unicode codepoint carries a `Script` property defined in UAX #24. The Unicode Character Database (UCD) ships `Scripts.txt` and the companion `ScriptExtensions.txt`. Script extensions matter because some codepoints — most common-use punctuation, digits U+0030–U+0039, and combining marks — are legitimately shared across scripts and carry the `Common` or `Inherited` value rather than a specific script name.

A `pdftract` span classifier should resolve script assignments in this priority order:

1. **Specific script** — codepoints with a single non-`Common`, non-`Inherited` script assignment are classified directly.
2. **Script extensions** — codepoints with multiple entries in `ScriptExtensions.txt` (e.g., U+0300 COMBINING GRAVE ACCENT extends into `Latin`, `Greek`, `Cyrillic`) inherit the script of the surrounding run.
3. **Common/Inherited** — treated as transparent; they attach to the script of the nearest resolved codepoint within the same bidi run.

### Mixed-Script Spans

A single PDF text object can contain codepoints from multiple scripts (e.g., a Japanese sentence with embedded Latin product names). The standard approach is **script-run segmentation**: scan the codepoint sequence left to right, maintaining a current script state, and emit a new span boundary whenever the resolved script changes from one specific value to another. `Common` and `Inherited` codepoints do not trigger boundaries.

The Unicode `ScriptExtensions` data can be used to suppress spurious splits: if a `Common` punctuation character appears between two Latin spans with no intervening RTL text, it should remain in the Latin span rather than producing a one-character `Common` fragment.

### CJK Script Identification

CJK requires distinguishing four overlapping script blocks:

| Script | Key Ranges |
|--------|-----------|
| Han | U+4E00–U+9FFF (BMP), U+3400–U+4DBF (Extension A), U+20000–U+2A6DF (Extension B) |
| Hiragana | U+3041–U+3096 |
| Katakana | U+30A1–U+30FA, U+31F0–U+31FF |
| Hangul | U+AC00–U+D7A3 (syllables), U+1100–U+11FF (jamo) |

Han is shared across Chinese, Japanese, and Korean. Language detection (Section 7) must disambiguate Han-dominant runs; script detection alone cannot.

### PDF `/Lang` Attribute

Tagged PDFs may carry a `/Lang` entry (BCP 47 language tag) on the document catalog, individual structure elements, or marked-content sequences. When present, `/Lang` is a strong prior:

- `ja` → expect Han + Hiragana + Katakana, writing mode potentially vertical.
- `ar` or `he` → expect RTL bidi direction, visual-order glyph storage likely.
- `zh-TW` vs. `zh-CN` → disambiguates Traditional vs. Simplified Han.

When `/Lang` is absent or when extracted text falls outside the declared language's expected scripts, fall back to character-level detection. Never suppress the fallback entirely: many PDFs carry a top-level `/Lang` that does not apply uniformly to all content (e.g., an English document with a Hebrew quotation).

---

## 2. Unicode Bidirectional Algorithm (UBA, UAX #9)

### Algorithm Structure

UAX #9 defines a multi-pass algorithm over a paragraph of codepoints. Each codepoint has a **bidi character type** (Strong: L/R/AL; Weak: EN/ES/ET/AN/CS/NSM/BN; Neutral: B/S/WS/ON; Explicit: LRE/RLE/LRO/RLO/PDF/LRI/RLI/FSI/PDI).

Key steps:

1. **Paragraph embedding level**: if the first strong character is R or AL, the paragraph is RTL (embedding level 1); otherwise LTR (level 0).
2. **Explicit level runs**: `LRE`/`RLE` push a new embedding level; `PDF` pops. The isolate controls (`LRI`/`RLI`/`FSI`/`PDI`, introduced in Unicode 6.3) create isolated bidi contexts that do not affect the surrounding paragraph's level stack.
3. **Weak type resolution**: sequences of weak types are resolved based on surrounding strong types per a finite-state table.
4. **Neutral resolution**: neutral characters between two same-direction strong runs take that direction; between opposing runs they take the paragraph direction.
5. **Reorder**: within each level run, apply the level-based reordering algorithm to produce visual order.

### Why PDF Breaks Bidi

PDF authoring tools generally emit glyphs in **visual order** for RTL text rather than in logical (Unicode) order. The content stream positions each glyph individually on the page via the text matrix; there is no implicit cursor advance that encodes reading direction. An Arabic sentence rendered right-to-left appears in the content stream starting from the rightmost glyph.

Consequences for extraction:

- Naively reading content-stream character codes left-to-right from a page produces reversed Arabic/Hebrew words.
- Mixed LTR/RTL content is interleaved in spatial order: the leftmost object on the page comes first in the stream, regardless of its logical position in the paragraph.

### Detecting and Reversing Visual-Order RTL

Detection heuristic: after Unicode recovery, if a run of characters with strong R or AL bidi type appears in left-to-right spatial order (i.e., X coordinates increase as the content-stream position increases), the run is stored in visual order and must be reversed. The threshold for "increasing X" should tolerate per-glyph kerning noise (±2 units in text space).

Reversal procedure:

1. Identify the visual-order run boundaries (the span between two LTR-direction glyphs or page-object boundaries).
2. Reverse the codepoint sequence within each RTL word (space-delimited or width-gap-delimited).
3. Apply UBA to the reassembled logical string to verify paragraph direction.

Note: some PDF producers (notably newer versions of Adobe Acrobat) do store RTL text in logical order with correct ToUnicode. The detection heuristic must be conditional, not unconditional.

---

## 3. Arabic and Hebrew Specifics

### Arabic Shaping and Presentation Forms

Arabic uses a joining model: each base letter has up to four contextual glyph forms — **isolated**, **initial**, **medial**, and **final** — determined by whether the character joins to the preceding and/or following letter. Critically, all four forms map to the same base Unicode codepoint. A PDF font may embed glyphs named `uniFE8D` (isolated alef) or `uniFE8E` (final alef), which are Arabic Presentation Forms from the block U+FB50–U+FDFF (Presentation Forms-A) and U+FE70–U+FEFF (Presentation Forms-B).

Normalization: apply Unicode compatibility decomposition (NFKD or NFKC) to map presentation forms to their base codepoints. For the ligature block (U+FB50–U+FDFF), some entries (e.g., U+FB8A ARABIC LETTER TCHEH WITH THREE DOTS ABOVE) lack a NFKC decomposition and should be preserved as-is. After normalization, the shaping context is lost, but the logical character identity is recovered — which is what text extraction requires.

Mandatory ligatures such as **lam-alef** (U+0644 + U+0627 and variants) have precomposed forms in the presentation block. These should be expanded back to their two-codepoint sequences during normalization.

### Hebrew Vowel Points and Cantillation

Hebrew base letters (U+05D0–U+05EA) may be followed by **nikud** (vowel points, U+05B0–U+05C7) and **cantillation marks** (U+0591–U+05AF). These are combining characters with `Inherited` bidi type, which means they correctly attach to the preceding base letter in logical order. For plain-text extraction, nikud and cantillation can be optionally stripped or preserved depending on the output mode; `pdftract` should expose a normalization flag `strip_combining_marks: bool` per script.

### RTL Word Boundaries Without Spaces

Some Arabic PDFs omit inter-word spaces in the content stream (words are positioned by glyph advances rather than space characters). Word boundary detection falls back to **X-gap analysis**: a gap between adjacent glyphs significantly larger than the average intra-word advance (heuristic: > 0.25 × em) is treated as a word boundary.

---

## 4. CJK Handling

### Horizontal vs. Vertical Writing Modes

PDF CMaps carry a `/WMode` entry: `0` = horizontal, `1` = vertical. A font may embed two CMaps — a horizontal CMap (name ending in `-H`) and a vertical CMap (name ending in `-V`). The content stream selects between them via the font resource's `/Encoding` or via direct CIDFont reference.

CJK punctuation normalization: fullwidth forms (U+FF01–U+FF60) are compatibility equivalents of their ASCII counterparts. For prose extraction, map fullwidth to halfwidth via NFKC unless the output is destined for layout-sensitive consumers. The `pdftract` normalization pipeline should apply NFKC only to `Common`-script fullwidth/halfwidth punctuation, not to Han or Kana characters (NFKC decomposes some compatibility Kana which should be preserved).

### CJK Line-Break Rules (UAX #14)

The Unicode Line Breaking Algorithm (UAX #14) defines **non-starter** characters (closing brackets, closing quotation marks, Japanese small kana: ぁぃぅぇぉっゃゅょ) that cannot begin a line, and **non-ender** characters (opening brackets) that cannot end a line. When `pdftract` reassembles lines from individual glyphs, these rules inform the merge heuristic: a glyph with a non-starter break class that appears at the apparent start of a new line in the spatial layout should be joined to the preceding line.

---

## 5. Vertical Text

### PDF Encoding of Vertical CJK

In vertical writing mode, the text matrix in the content stream applies a 90-degree rotation: the current transformation matrix (CTM) component produces a glyph that advances downward rather than rightward. The glyph's width in the font metrics becomes its vertical advance, and the horizontal dimension becomes the em-square height.

Detection: examine the `Tm` (text matrix) operator. A matrix of the form `[0 -1 1 0 tx ty]` or `[0 1 -1 0 tx ty]` indicates vertical text. Combined with `/WMode 1` in the CMap, this is a reliable signal.

Reconstruction: to recover horizontal reading order from a vertical column:

1. Sort glyphs by decreasing Y within a column (top-to-bottom).
2. Sort columns by increasing X (left-to-right for vertical text flowing left-to-right between columns, which is the default for Japanese).
3. Assign direction `ttb` to the span.

### Tate-Chu-Yoko

Tate-chu-yoko (縦中横) is a typographic convention where a short horizontal sequence (typically 2–4 Latin characters or digits) is set horizontally within a vertical line. In PDF, these glyphs appear without the 90-degree rotation applied to surrounding CJK glyphs. Detection: within a vertical column, glyphs with a non-rotated text matrix and Latin/digit script classification form a tate-chu-yoko inline sequence. They should be extracted as a single horizontal sub-span with direction `ltr`, embedded within the enclosing `ttb` span.

---

## 6. Ligatures and Script-Specific Normalization

### Unicode Normalization Forms

| Form | Definition | Use in pdftract |
|------|-----------|----------------|
| NFC | Canonical decomposition then canonical composition | Default for Latin, Greek, Cyrillic output |
| NFD | Canonical decomposition only | Internal processing of combining marks |
| NFKC | Compatibility decomposition then canonical composition | Arabic presentation forms, fullwidth CJK punctuation |
| NFKD | Compatibility decomposition only | Intermediate step for specific scripts |

Apply NFKC selectively: Arabic (to collapse presentation forms), fullwidth punctuation (U+FF01–U+FF60), and Latin ligatures from the Alphabetic Presentation Forms block (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st).

### Latin Ligatures

The glyphs `fi`, `fl`, `ff`, `ffi`, `ffl` have explicit Unicode codepoints (U+FB01, U+FB02, U+FB00, U+FB03, U+FB04). PDF fonts commonly use these as single glyphs mapped via ToUnicode to either the precomposed ligature or the two-character sequence. For text search and NLP compatibility, always expand to the constituent characters: `fi` → U+0066 U+0069. Preserve the original ligature codepoint in a `raw_codepoints` field if the consumer needs to reconstruct original layout.

### Devanagari Conjunct Consonants

Devanagari conjunct consonants (Sanskrit: saṃyuktākṣara) are encoded in Unicode as a base consonant + virama (U+094D) + following consonant. PDF fonts may embed precomposed conjunct glyphs that have no standard Unicode representation. Recovery requires mapping via the font's glyph name (e.g., `kka` → U+0915 U+094D U+0915) using a glyph-name-to-sequence table. NFD decomposition of Devanagari preserves the logical structure and should be preferred over NFC for output.

---

## 7. Language Detection

### Statistical and Dictionary Approaches

For runs of 50+ characters with a known script, statistical **n-gram language identification** is reliable. The `whatlang` crate (Rust) uses trigram frequency profiles for 69 languages; the `lingua` crate supports 75 languages with a higher-accuracy bigram + unigram model at the cost of a larger compiled profile set. Both crates accept `&str` and return a language tag with confidence score.

For shorter spans (10–50 characters), dictionary-based detection — checking whether the top-N most frequent words from a candidate language appear in the span — outperforms n-gram models. Maintain per-script stop-word lists (the 200 most frequent words per language) compiled into the binary.

### Using `/Lang` as a Prior

When the PDF supplies `/Lang`, use it to bias detection: if the extracted text scores above 0.4 confidence for the declared language, accept the declaration. If the text scores below 0.4 for the declared language but above 0.7 for another, emit a `lang_conflict` warning and use the detected language. If detection confidence is below 0.4 for all candidates, emit `und` (undetermined).

Confidence threshold summary:

| Condition | Output |
|-----------|--------|
| `/Lang` present, detection ≥ 0.4 for declared | Use `/Lang` tag |
| `/Lang` present, conflict detected (other ≥ 0.7) | Use detected tag, warn |
| `/Lang` absent, detection ≥ 0.6 | Use detected tag |
| Any path, confidence < 0.4 | `und` |

---

## 8. Output Metadata on Spans and Blocks

Each extracted `Span` and `Block` in the `pdftract` JSON output carries the following language and script metadata:

```json
{
  "text": "مرحباً بالعالم",
  "lang": "ar",
  "script": "Arab",
  "direction": "rtl",
  "normalization": ["nfkc", "visual_order_reversed"],
  "lang_confidence": 0.92,
  "writing_mode": "horizontal"
}
```

Field definitions:

- **`lang`** — BCP 47 language tag (e.g., `ar`, `he`, `ja`, `zh-TW`, `und`). Sourced from `/Lang` or detection.
- **`script`** — ISO 15924 four-letter script code (e.g., `Arab`, `Hebr`, `Hani`, `Hira`, `Hang`, `Deva`, `Thai`, `Latn`). Derived from UAX #24 per-codepoint classification, taking the dominant script of the span.
- **`direction`** — One of `ltr`, `rtl`, or `ttb`. Derived from UBA paragraph direction for horizontal text; `ttb` set when vertical writing mode is detected via CTM analysis and `/WMode 1`.
- **`normalization`** — Array of normalization operations applied, in application order. Valid values: `nfc`, `nfkc`, `nfd`, `nfkd`, `visual_order_reversed`, `ligature_expanded`, `presentation_forms_collapsed`, `combining_marks_stripped`.
- **`lang_confidence`** — Float in [0.0, 1.0] from the language detector. Omitted when `lang` is sourced from `/Lang` and no conflict was detected. Set to `null` when `lang` is `und`.
- **`writing_mode`** — `horizontal` or `vertical`. `vertical` implies `direction` is `ttb`; tate-chu-yoko sub-spans within a vertical block carry `direction: ltr` and `writing_mode: horizontal`.

Blocks aggregate span metadata: the `script` and `lang` of a block are the modal values across its constituent spans. Blocks containing spans from more than one script carry a `mixed_script: true` flag and list all scripts in a `scripts` array alongside the dominant `script` field.