- table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
214 lines
16 KiB
Markdown
214 lines
16 KiB
Markdown
# Language Detection and Script Handling in pdftract
|
||
|
||
## Overview
|
||
|
||
Multilingual PDF documents expose three distinct problems for a text extraction library: identifying which Unicode script a sequence of codepoints belongs to, reconstructing logical order from glyphs that may have been stored in visual order, and normalizing script-specific presentation variants to canonical Unicode forms. This document covers each problem, the relevant standards, and the implementation strategy for `pdftract`.
|
||
|
||
---
|
||
|
||
## 1. Script Detection from Glyph Data
|
||
|
||
### Unicode Script Property (UAX #24)
|
||
|
||
Every Unicode codepoint carries a `Script` property defined in UAX #24. The Unicode Character Database (UCD) ships `Scripts.txt` and the companion `ScriptExtensions.txt`. Script extensions matter because some codepoints — most common-use punctuation, digits U+0030–U+0039, and combining marks — are legitimately shared across scripts and carry the `Common` or `Inherited` value rather than a specific script name.
|
||
|
||
A `pdftract` span classifier should resolve script assignments in this priority order:
|
||
|
||
1. **Specific script** — codepoints with a single non-`Common`, non-`Inherited` script assignment are classified directly.
|
||
2. **Script extensions** — codepoints with multiple entries in `ScriptExtensions.txt` (e.g., U+0300 COMBINING GRAVE ACCENT extends into `Latin`, `Greek`, `Cyrillic`) inherit the script of the surrounding run.
|
||
3. **Common/Inherited** — treated as transparent; they attach to the script of the nearest resolved codepoint within the same bidi run.
|
||
|
||
### Mixed-Script Spans
|
||
|
||
A single PDF text object can contain codepoints from multiple scripts (e.g., a Japanese sentence with embedded Latin product names). The standard approach is **script-run segmentation**: scan the codepoint sequence left to right, maintaining a current script state, and emit a new span boundary whenever the resolved script changes from one specific value to another. `Common` and `Inherited` codepoints do not trigger boundaries.
|
||
|
||
The Unicode `ScriptExtensions` data can be used to suppress spurious splits: if a `Common` punctuation character appears between two Latin spans with no intervening RTL text, it should remain in the Latin span rather than producing a one-character `Common` fragment.
|
||
|
||
### CJK Script Identification
|
||
|
||
CJK requires distinguishing four overlapping script blocks:
|
||
|
||
| Script | Key Ranges |
|
||
|--------|-----------|
|
||
| Han | U+4E00–U+9FFF (BMP), U+3400–U+4DBF (Extension A), U+20000–U+2A6DF (Extension B) |
|
||
| Hiragana | U+3041–U+3096 |
|
||
| Katakana | U+30A1–U+30FA, U+31F0–U+31FF |
|
||
| Hangul | U+AC00–U+D7A3 (syllables), U+1100–U+11FF (jamo) |
|
||
|
||
Han is shared across Chinese, Japanese, and Korean. Language detection (Section 7) must disambiguate Han-dominant runs; script detection alone cannot.
|
||
|
||
### PDF `/Lang` Attribute
|
||
|
||
Tagged PDFs may carry a `/Lang` entry (BCP 47 language tag) on the document catalog, individual structure elements, or marked-content sequences. When present, `/Lang` is a strong prior:
|
||
|
||
- `ja` → expect Han + Hiragana + Katakana, writing mode potentially vertical.
|
||
- `ar` or `he` → expect RTL bidi direction, visual-order glyph storage likely.
|
||
- `zh-TW` vs. `zh-CN` → disambiguates Traditional vs. Simplified Han.
|
||
|
||
When `/Lang` is absent or when extracted text falls outside the declared language's expected scripts, fall back to character-level detection. Never suppress the fallback entirely: many PDFs carry a top-level `/Lang` that does not apply uniformly to all content (e.g., an English document with a Hebrew quotation).
|
||
|
||
---
|
||
|
||
## 2. Unicode Bidirectional Algorithm (UBA, UAX #9)
|
||
|
||
### Algorithm Structure
|
||
|
||
UAX #9 defines a multi-pass algorithm over a paragraph of codepoints. Each codepoint has a **bidi character type** (Strong: L/R/AL; Weak: EN/ES/ET/AN/CS/NSM/BN; Neutral: B/S/WS/ON; Explicit: LRE/RLE/LRO/RLO/PDF/LRI/RLI/FSI/PDI).
|
||
|
||
Key steps:
|
||
|
||
1. **Paragraph embedding level**: if the first strong character is R or AL, the paragraph is RTL (embedding level 1); otherwise LTR (level 0).
|
||
2. **Explicit level runs**: `LRE`/`RLE` push a new embedding level; `PDF` pops. The isolate controls (`LRI`/`RLI`/`FSI`/`PDI`, introduced in Unicode 6.3) create isolated bidi contexts that do not affect the surrounding paragraph's level stack.
|
||
3. **Weak type resolution**: sequences of weak types are resolved based on surrounding strong types per a finite-state table.
|
||
4. **Neutral resolution**: neutral characters between two same-direction strong runs take that direction; between opposing runs they take the paragraph direction.
|
||
5. **Reorder**: within each level run, apply the level-based reordering algorithm to produce visual order.
|
||
|
||
### Why PDF Breaks Bidi
|
||
|
||
PDF authoring tools generally emit glyphs in **visual order** for RTL text rather than in logical (Unicode) order. The content stream positions each glyph individually on the page via the text matrix; there is no implicit cursor advance that encodes reading direction. An Arabic sentence rendered right-to-left appears in the content stream starting from the rightmost glyph.
|
||
|
||
Consequences for extraction:
|
||
|
||
- Naively reading content-stream character codes left-to-right from a page produces reversed Arabic/Hebrew words.
|
||
- Mixed LTR/RTL content is interleaved in spatial order: the leftmost object on the page comes first in the stream, regardless of its logical position in the paragraph.
|
||
|
||
### Detecting and Reversing Visual-Order RTL
|
||
|
||
Detection heuristic: after Unicode recovery, if a run of characters with strong R or AL bidi type appears in left-to-right spatial order (i.e., X coordinates increase as the content-stream position increases), the run is stored in visual order and must be reversed. The threshold for "increasing X" should tolerate per-glyph kerning noise (±2 units in text space).
|
||
|
||
Reversal procedure:
|
||
|
||
1. Identify the visual-order run boundaries (the span between two LTR-direction glyphs or page-object boundaries).
|
||
2. Reverse the codepoint sequence within each RTL word (space-delimited or width-gap-delimited).
|
||
3. Apply UBA to the reassembled logical string to verify paragraph direction.
|
||
|
||
Note: some PDF producers (notably newer versions of Adobe Acrobat) do store RTL text in logical order with correct ToUnicode. The detection heuristic must be conditional, not unconditional.
|
||
|
||
---
|
||
|
||
## 3. Arabic and Hebrew Specifics
|
||
|
||
### Arabic Shaping and Presentation Forms
|
||
|
||
Arabic uses a joining model: each base letter has up to four contextual glyph forms — **isolated**, **initial**, **medial**, and **final** — determined by whether the character joins to the preceding and/or following letter. Critically, all four forms map to the same base Unicode codepoint. A PDF font may embed glyphs named `uniFE8D` (isolated alef) or `uniFE8E` (final alef), which are Arabic Presentation Forms from the block U+FB50–U+FDFF (Presentation Forms-A) and U+FE70–U+FEFF (Presentation Forms-B).
|
||
|
||
Normalization: apply Unicode compatibility decomposition (NFKD or NFKC) to map presentation forms to their base codepoints. For the ligature block (U+FB50–U+FDFF), some entries (e.g., U+FB8A ARABIC LETTER TCHEH WITH THREE DOTS ABOVE) lack a NFKC decomposition and should be preserved as-is. After normalization, the shaping context is lost, but the logical character identity is recovered — which is what text extraction requires.
|
||
|
||
Mandatory ligatures such as **lam-alef** (U+0644 + U+0627 and variants) have precomposed forms in the presentation block. These should be expanded back to their two-codepoint sequences during normalization.
|
||
|
||
### Hebrew Vowel Points and Cantillation
|
||
|
||
Hebrew base letters (U+05D0–U+05EA) may be followed by **nikud** (vowel points, U+05B0–U+05C7) and **cantillation marks** (U+0591–U+05AF). These are combining characters with `Inherited` bidi type, which means they correctly attach to the preceding base letter in logical order. For plain-text extraction, nikud and cantillation can be optionally stripped or preserved depending on the output mode; `pdftract` should expose a normalization flag `strip_combining_marks: bool` per script.
|
||
|
||
### RTL Word Boundaries Without Spaces
|
||
|
||
Some Arabic PDFs omit inter-word spaces in the content stream (words are positioned by glyph advances rather than space characters). Word boundary detection falls back to **X-gap analysis**: a gap between adjacent glyphs significantly larger than the average intra-word advance (heuristic: > 0.25 × em) is treated as a word boundary.
|
||
|
||
---
|
||
|
||
## 4. CJK Handling
|
||
|
||
### Horizontal vs. Vertical Writing Modes
|
||
|
||
PDF CMaps carry a `/WMode` entry: `0` = horizontal, `1` = vertical. A font may embed two CMaps — a horizontal CMap (name ending in `-H`) and a vertical CMap (name ending in `-V`). The content stream selects between them via the font resource's `/Encoding` or via direct CIDFont reference.
|
||
|
||
CJK punctuation normalization: fullwidth forms (U+FF01–U+FF60) are compatibility equivalents of their ASCII counterparts. For prose extraction, map fullwidth to halfwidth via NFKC unless the output is destined for layout-sensitive consumers. The `pdftract` normalization pipeline should apply NFKC only to `Common`-script fullwidth/halfwidth punctuation, not to Han or Kana characters (NFKC decomposes some compatibility Kana which should be preserved).
|
||
|
||
### CJK Line-Break Rules (UAX #14)
|
||
|
||
The Unicode Line Breaking Algorithm (UAX #14) defines **non-starter** characters (closing brackets, closing quotation marks, Japanese small kana: ぁぃぅぇぉっゃゅょ) that cannot begin a line, and **non-ender** characters (opening brackets) that cannot end a line. When `pdftract` reassembles lines from individual glyphs, these rules inform the merge heuristic: a glyph with a non-starter break class that appears at the apparent start of a new line in the spatial layout should be joined to the preceding line.
|
||
|
||
---
|
||
|
||
## 5. Vertical Text
|
||
|
||
### PDF Encoding of Vertical CJK
|
||
|
||
In vertical writing mode, the text matrix in the content stream applies a 90-degree rotation: the current transformation matrix (CTM) component produces a glyph that advances downward rather than rightward. The glyph's width in the font metrics becomes its vertical advance, and the horizontal dimension becomes the em-square height.
|
||
|
||
Detection: examine the `Tm` (text matrix) operator. A matrix of the form `[0 -1 1 0 tx ty]` or `[0 1 -1 0 tx ty]` indicates vertical text. Combined with `/WMode 1` in the CMap, this is a reliable signal.
|
||
|
||
Reconstruction: to recover horizontal reading order from a vertical column:
|
||
|
||
1. Sort glyphs by decreasing Y within a column (top-to-bottom).
|
||
2. Sort columns by increasing X (left-to-right for vertical text flowing left-to-right between columns, which is the default for Japanese).
|
||
3. Assign direction `ttb` to the span.
|
||
|
||
### Tate-Chu-Yoko
|
||
|
||
Tate-chu-yoko (縦中横) is a typographic convention where a short horizontal sequence (typically 2–4 Latin characters or digits) is set horizontally within a vertical line. In PDF, these glyphs appear without the 90-degree rotation applied to surrounding CJK glyphs. Detection: within a vertical column, glyphs with a non-rotated text matrix and Latin/digit script classification form a tate-chu-yoko inline sequence. They should be extracted as a single horizontal sub-span with direction `ltr`, embedded within the enclosing `ttb` span.
|
||
|
||
---
|
||
|
||
## 6. Ligatures and Script-Specific Normalization
|
||
|
||
### Unicode Normalization Forms
|
||
|
||
| Form | Definition | Use in pdftract |
|
||
|------|-----------|----------------|
|
||
| NFC | Canonical decomposition then canonical composition | Default for Latin, Greek, Cyrillic output |
|
||
| NFD | Canonical decomposition only | Internal processing of combining marks |
|
||
| NFKC | Compatibility decomposition then canonical composition | Arabic presentation forms, fullwidth CJK punctuation |
|
||
| NFKD | Compatibility decomposition only | Intermediate step for specific scripts |
|
||
|
||
Apply NFKC selectively: Arabic (to collapse presentation forms), fullwidth punctuation (U+FF01–U+FF60), and Latin ligatures from the Alphabetic Presentation Forms block (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st).
|
||
|
||
### Latin Ligatures
|
||
|
||
The glyphs `fi`, `fl`, `ff`, `ffi`, `ffl` have explicit Unicode codepoints (U+FB01, U+FB02, U+FB00, U+FB03, U+FB04). PDF fonts commonly use these as single glyphs mapped via ToUnicode to either the precomposed ligature or the two-character sequence. For text search and NLP compatibility, always expand to the constituent characters: `fi` → U+0066 U+0069. Preserve the original ligature codepoint in a `raw_codepoints` field if the consumer needs to reconstruct original layout.
|
||
|
||
### Devanagari Conjunct Consonants
|
||
|
||
Devanagari conjunct consonants (Sanskrit: saṃyuktākṣara) are encoded in Unicode as a base consonant + virama (U+094D) + following consonant. PDF fonts may embed precomposed conjunct glyphs that have no standard Unicode representation. Recovery requires mapping via the font's glyph name (e.g., `kka` → U+0915 U+094D U+0915) using a glyph-name-to-sequence table. NFD decomposition of Devanagari preserves the logical structure and should be preferred over NFC for output.
|
||
|
||
---
|
||
|
||
## 7. Language Detection
|
||
|
||
### Statistical and Dictionary Approaches
|
||
|
||
For runs of 50+ characters with a known script, statistical **n-gram language identification** is reliable. The `whatlang` crate (Rust) uses trigram frequency profiles for 69 languages; the `lingua` crate supports 75 languages with a higher-accuracy bigram + unigram model at the cost of a larger compiled profile set. Both crates accept `&str` and return a language tag with confidence score.
|
||
|
||
For shorter spans (10–50 characters), dictionary-based detection — checking whether the top-N most frequent words from a candidate language appear in the span — outperforms n-gram models. Maintain per-script stop-word lists (the 200 most frequent words per language) compiled into the binary.
|
||
|
||
### Using `/Lang` as a Prior
|
||
|
||
When the PDF supplies `/Lang`, use it to bias detection: if the extracted text scores above 0.4 confidence for the declared language, accept the declaration. If the text scores below 0.4 for the declared language but above 0.7 for another, emit a `lang_conflict` warning and use the detected language. If detection confidence is below 0.4 for all candidates, emit `und` (undetermined).
|
||
|
||
Confidence threshold summary:
|
||
|
||
| Condition | Output |
|
||
|-----------|--------|
|
||
| `/Lang` present, detection ≥ 0.4 for declared | Use `/Lang` tag |
|
||
| `/Lang` present, conflict detected (other ≥ 0.7) | Use detected tag, warn |
|
||
| `/Lang` absent, detection ≥ 0.6 | Use detected tag |
|
||
| Any path, confidence < 0.4 | `und` |
|
||
|
||
---
|
||
|
||
## 8. Output Metadata on Spans and Blocks
|
||
|
||
Each extracted `Span` and `Block` in the `pdftract` JSON output carries the following language and script metadata:
|
||
|
||
```json
|
||
{
|
||
"text": "مرحباً بالعالم",
|
||
"lang": "ar",
|
||
"script": "Arab",
|
||
"direction": "rtl",
|
||
"normalization": ["nfkc", "visual_order_reversed"],
|
||
"lang_confidence": 0.92,
|
||
"writing_mode": "horizontal"
|
||
}
|
||
```
|
||
|
||
Field definitions:
|
||
|
||
- **`lang`** — BCP 47 language tag (e.g., `ar`, `he`, `ja`, `zh-TW`, `und`). Sourced from `/Lang` or detection.
|
||
- **`script`** — ISO 15924 four-letter script code (e.g., `Arab`, `Hebr`, `Hani`, `Hira`, `Hang`, `Deva`, `Thai`, `Latn`). Derived from UAX #24 per-codepoint classification, taking the dominant script of the span.
|
||
- **`direction`** — One of `ltr`, `rtl`, or `ttb`. Derived from UBA paragraph direction for horizontal text; `ttb` set when vertical writing mode is detected via CTM analysis and `/WMode 1`.
|
||
- **`normalization`** — Array of normalization operations applied, in application order. Valid values: `nfc`, `nfkc`, `nfd`, `nfkd`, `visual_order_reversed`, `ligature_expanded`, `presentation_forms_collapsed`, `combining_marks_stripped`.
|
||
- **`lang_confidence`** — Float in [0.0, 1.0] from the language detector. Omitted when `lang` is sourced from `/Lang` and no conflict was detected. Set to `null` when `lang` is `und`.
|
||
- **`writing_mode`** — `horizontal` or `vertical`. `vertical` implies `direction` is `ttb`; tate-chu-yoko sub-spans within a vertical block carry `direction: ltr` and `writing_mode: horizontal`.
|
||
|
||
Blocks aggregate span metadata: the `script` and `lang` of a block are the modal values across its constituent spans. Blocks containing spans from more than one script carry a `mixed_script: true` flag and list all scripts in a `scripts` array alongside the dominant `script` field.
|