jedarden b805593973 Add six research documents covering output-side extraction topics

- table-structure-reconstruction: line detection, gap analysis, Hough
  transform, graph-based cell reconstruction, merged cells, multi-page tables
- mathematical-expression-handling: five encoding cases, OpenType MATH table,
  symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers
- language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi,
  CJK vertical text, ligature normalization, whatlang/lingua integration
- document-classification-and-zone-labeling: margin heuristics, font
  clustering, cross-page recurrence, footnote/caption/sidebar detection
- post-extraction-normalization: hyphen handling, ligature expansion,
  paragraph reconstruction, Unicode normalization, pipeline ordering
- chunking-for-llm-consumption: semantic snapping, heading hierarchy,
  sliding window overlap, table chunking strategies, token budget, late chunking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 14:56:25 -04:00

16 KiB

Raw Blame History

Language Detection and Script Handling in pdftract

Overview

Multilingual PDF documents expose three distinct problems for a text extraction library: identifying which Unicode script a sequence of codepoints belongs to, reconstructing logical order from glyphs that may have been stored in visual order, and normalizing script-specific presentation variants to canonical Unicode forms. This document covers each problem, the relevant standards, and the implementation strategy for pdftract.

1. Script Detection from Glyph Data

Unicode Script Property (UAX #24)

Every Unicode codepoint carries a Script property defined in UAX #24. The Unicode Character Database (UCD) ships Scripts.txt and the companion ScriptExtensions.txt. Script extensions matter because some codepoints — most common-use punctuation, digits U+0030–U+0039, and combining marks — are legitimately shared across scripts and carry the Common or Inherited value rather than a specific script name.

A pdftract span classifier should resolve script assignments in this priority order:

Specific script — codepoints with a single non-Common, non-Inherited script assignment are classified directly.
Script extensions — codepoints with multiple entries in ScriptExtensions.txt (e.g., U+0300 COMBINING GRAVE ACCENT extends into Latin, Greek, Cyrillic) inherit the script of the surrounding run.
Common/Inherited — treated as transparent; they attach to the script of the nearest resolved codepoint within the same bidi run.

Mixed-Script Spans

A single PDF text object can contain codepoints from multiple scripts (e.g., a Japanese sentence with embedded Latin product names). The standard approach is script-run segmentation: scan the codepoint sequence left to right, maintaining a current script state, and emit a new span boundary whenever the resolved script changes from one specific value to another. Common and Inherited codepoints do not trigger boundaries.

The Unicode ScriptExtensions data can be used to suppress spurious splits: if a Common punctuation character appears between two Latin spans with no intervening RTL text, it should remain in the Latin span rather than producing a one-character Common fragment.

CJK Script Identification

CJK requires distinguishing four overlapping script blocks:

Script	Key Ranges
Han	U+4E00–U+9FFF (BMP), U+3400–U+4DBF (Extension A), U+20000–U+2A6DF (Extension B)
Hiragana	U+3041–U+3096
Katakana	U+30A1–U+30FA, U+31F0–U+31FF
Hangul	U+AC00–U+D7A3 (syllables), U+1100–U+11FF (jamo)

Han is shared across Chinese, Japanese, and Korean. Language detection (Section 7) must disambiguate Han-dominant runs; script detection alone cannot.

PDF `/Lang` Attribute

Tagged PDFs may carry a /Lang entry (BCP 47 language tag) on the document catalog, individual structure elements, or marked-content sequences. When present, /Lang is a strong prior:

ja → expect Han + Hiragana + Katakana, writing mode potentially vertical.
ar or he → expect RTL bidi direction, visual-order glyph storage likely.
zh-TW vs. zh-CN → disambiguates Traditional vs. Simplified Han.

When /Lang is absent or when extracted text falls outside the declared language's expected scripts, fall back to character-level detection. Never suppress the fallback entirely: many PDFs carry a top-level /Lang that does not apply uniformly to all content (e.g., an English document with a Hebrew quotation).

2. Unicode Bidirectional Algorithm (UBA, UAX #9)

Algorithm Structure

UAX #9 defines a multi-pass algorithm over a paragraph of codepoints. Each codepoint has a bidi character type (Strong: L/R/AL; Weak: EN/ES/ET/AN/CS/NSM/BN; Neutral: B/S/WS/ON; Explicit: LRE/RLE/LRO/RLO/PDF/LRI/RLI/FSI/PDI).

Key steps:

Paragraph embedding level: if the first strong character is R or AL, the paragraph is RTL (embedding level 1); otherwise LTR (level 0).
Explicit level runs: LRE/RLE push a new embedding level; PDF pops. The isolate controls (LRI/RLI/FSI/PDI, introduced in Unicode 6.3) create isolated bidi contexts that do not affect the surrounding paragraph's level stack.
Weak type resolution: sequences of weak types are resolved based on surrounding strong types per a finite-state table.
Neutral resolution: neutral characters between two same-direction strong runs take that direction; between opposing runs they take the paragraph direction.
Reorder: within each level run, apply the level-based reordering algorithm to produce visual order.

Why PDF Breaks Bidi

PDF authoring tools generally emit glyphs in visual order for RTL text rather than in logical (Unicode) order. The content stream positions each glyph individually on the page via the text matrix; there is no implicit cursor advance that encodes reading direction. An Arabic sentence rendered right-to-left appears in the content stream starting from the rightmost glyph.

Consequences for extraction:

Naively reading content-stream character codes left-to-right from a page produces reversed Arabic/Hebrew words.
Mixed LTR/RTL content is interleaved in spatial order: the leftmost object on the page comes first in the stream, regardless of its logical position in the paragraph.

Detecting and Reversing Visual-Order RTL

Detection heuristic: after Unicode recovery, if a run of characters with strong R or AL bidi type appears in left-to-right spatial order (i.e., X coordinates increase as the content-stream position increases), the run is stored in visual order and must be reversed. The threshold for "increasing X" should tolerate per-glyph kerning noise (±2 units in text space).

Reversal procedure:

Identify the visual-order run boundaries (the span between two LTR-direction glyphs or page-object boundaries).
Reverse the codepoint sequence within each RTL word (space-delimited or width-gap-delimited).
Apply UBA to the reassembled logical string to verify paragraph direction.

Note: some PDF producers (notably newer versions of Adobe Acrobat) do store RTL text in logical order with correct ToUnicode. The detection heuristic must be conditional, not unconditional.

3. Arabic and Hebrew Specifics

Arabic Shaping and Presentation Forms

Arabic uses a joining model: each base letter has up to four contextual glyph forms — isolated, initial, medial, and final — determined by whether the character joins to the preceding and/or following letter. Critically, all four forms map to the same base Unicode codepoint. A PDF font may embed glyphs named uniFE8D (isolated alef) or uniFE8E (final alef), which are Arabic Presentation Forms from the block U+FB50–U+FDFF (Presentation Forms-A) and U+FE70–U+FEFF (Presentation Forms-B).

Normalization: apply Unicode compatibility decomposition (NFKD or NFKC) to map presentation forms to their base codepoints. For the ligature block (U+FB50–U+FDFF), some entries (e.g., U+FB8A ARABIC LETTER TCHEH WITH THREE DOTS ABOVE) lack a NFKC decomposition and should be preserved as-is. After normalization, the shaping context is lost, but the logical character identity is recovered — which is what text extraction requires.

Mandatory ligatures such as lam-alef (U+0644 + U+0627 and variants) have precomposed forms in the presentation block. These should be expanded back to their two-codepoint sequences during normalization.

Hebrew Vowel Points and Cantillation

Hebrew base letters (U+05D0–U+05EA) may be followed by nikud (vowel points, U+05B0–U+05C7) and cantillation marks (U+0591–U+05AF). These are combining characters with Inherited bidi type, which means they correctly attach to the preceding base letter in logical order. For plain-text extraction, nikud and cantillation can be optionally stripped or preserved depending on the output mode; pdftract should expose a normalization flag strip_combining_marks: bool per script.

RTL Word Boundaries Without Spaces

Some Arabic PDFs omit inter-word spaces in the content stream (words are positioned by glyph advances rather than space characters). Word boundary detection falls back to X-gap analysis: a gap between adjacent glyphs significantly larger than the average intra-word advance (heuristic: > 0.25 × em) is treated as a word boundary.

4. CJK Handling

Horizontal vs. Vertical Writing Modes

PDF CMaps carry a /WMode entry: 0 = horizontal, 1 = vertical. A font may embed two CMaps — a horizontal CMap (name ending in -H) and a vertical CMap (name ending in -V). The content stream selects between them via the font resource's /Encoding or via direct CIDFont reference.

CJK punctuation normalization: fullwidth forms (U+FF01–U+FF60) are compatibility equivalents of their ASCII counterparts. For prose extraction, map fullwidth to halfwidth via NFKC unless the output is destined for layout-sensitive consumers. The pdftract normalization pipeline should apply NFKC only to Common-script fullwidth/halfwidth punctuation, not to Han or Kana characters (NFKC decomposes some compatibility Kana which should be preserved).

CJK Line-Break Rules (UAX #14)

The Unicode Line Breaking Algorithm (UAX #14) defines non-starter characters (closing brackets, closing quotation marks, Japanese small kana: ぁぃぅぇぉっゃゅょ) that cannot begin a line, and non-ender characters (opening brackets) that cannot end a line. When pdftract reassembles lines from individual glyphs, these rules inform the merge heuristic: a glyph with a non-starter break class that appears at the apparent start of a new line in the spatial layout should be joined to the preceding line.

5. Vertical Text

PDF Encoding of Vertical CJK

In vertical writing mode, the text matrix in the content stream applies a 90-degree rotation: the current transformation matrix (CTM) component produces a glyph that advances downward rather than rightward. The glyph's width in the font metrics becomes its vertical advance, and the horizontal dimension becomes the em-square height.

Detection: examine the Tm (text matrix) operator. A matrix of the form [0 -1 1 0 tx ty] or [0 1 -1 0 tx ty] indicates vertical text. Combined with /WMode 1 in the CMap, this is a reliable signal.

Reconstruction: to recover horizontal reading order from a vertical column:

Sort glyphs by decreasing Y within a column (top-to-bottom).
Sort columns by increasing X (left-to-right for vertical text flowing left-to-right between columns, which is the default for Japanese).
Assign direction ttb to the span.

Tate-Chu-Yoko

Tate-chu-yoko (縦中横) is a typographic convention where a short horizontal sequence (typically 2–4 Latin characters or digits) is set horizontally within a vertical line. In PDF, these glyphs appear without the 90-degree rotation applied to surrounding CJK glyphs. Detection: within a vertical column, glyphs with a non-rotated text matrix and Latin/digit script classification form a tate-chu-yoko inline sequence. They should be extracted as a single horizontal sub-span with direction ltr, embedded within the enclosing ttb span.

6. Ligatures and Script-Specific Normalization

Unicode Normalization Forms

Form	Definition	Use in pdftract
NFC	Canonical decomposition then canonical composition	Default for Latin, Greek, Cyrillic output
NFD	Canonical decomposition only	Internal processing of combining marks
NFKC	Compatibility decomposition then canonical composition	Arabic presentation forms, fullwidth CJK punctuation
NFKD	Compatibility decomposition only	Intermediate step for specific scripts

Apply NFKC selectively: Arabic (to collapse presentation forms), fullwidth punctuation (U+FF01–U+FF60), and Latin ligatures from the Alphabetic Presentation Forms block (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st).

Latin Ligatures

The glyphs fi, fl, ff, ffi, ffl have explicit Unicode codepoints (U+FB01, U+FB02, U+FB00, U+FB03, U+FB04). PDF fonts commonly use these as single glyphs mapped via ToUnicode to either the precomposed ligature or the two-character sequence. For text search and NLP compatibility, always expand to the constituent characters: fi → U+0066 U+0069. Preserve the original ligature codepoint in a raw_codepoints field if the consumer needs to reconstruct original layout.

Devanagari Conjunct Consonants

Devanagari conjunct consonants (Sanskrit: saṃyuktākṣara) are encoded in Unicode as a base consonant + virama (U+094D) + following consonant. PDF fonts may embed precomposed conjunct glyphs that have no standard Unicode representation. Recovery requires mapping via the font's glyph name (e.g., kka → U+0915 U+094D U+0915) using a glyph-name-to-sequence table. NFD decomposition of Devanagari preserves the logical structure and should be preferred over NFC for output.

7. Language Detection

Statistical and Dictionary Approaches

For runs of 50+ characters with a known script, statistical n-gram language identification is reliable. The whatlang crate (Rust) uses trigram frequency profiles for 69 languages; the lingua crate supports 75 languages with a higher-accuracy bigram + unigram model at the cost of a larger compiled profile set. Both crates accept &str and return a language tag with confidence score.

For shorter spans (10–50 characters), dictionary-based detection — checking whether the top-N most frequent words from a candidate language appear in the span — outperforms n-gram models. Maintain per-script stop-word lists (the 200 most frequent words per language) compiled into the binary.

Using `/Lang` as a Prior

When the PDF supplies /Lang, use it to bias detection: if the extracted text scores above 0.4 confidence for the declared language, accept the declaration. If the text scores below 0.4 for the declared language but above 0.7 for another, emit a lang_conflict warning and use the detected language. If detection confidence is below 0.4 for all candidates, emit und (undetermined).

Confidence threshold summary:

Condition	Output
`/Lang` present, detection ≥ 0.4 for declared	Use `/Lang` tag
`/Lang` present, conflict detected (other ≥ 0.7)	Use detected tag, warn
`/Lang` absent, detection ≥ 0.6	Use detected tag
Any path, confidence < 0.4	`und`

8. Output Metadata on Spans and Blocks

Each extracted Span and Block in the pdftract JSON output carries the following language and script metadata:

{
  "text": "مرحباً بالعالم",
  "lang": "ar",
  "script": "Arab",
  "direction": "rtl",
  "normalization": ["nfkc", "visual_order_reversed"],
  "lang_confidence": 0.92,
  "writing_mode": "horizontal"
}

Field definitions:

lang — BCP 47 language tag (e.g., ar, he, ja, zh-TW, und). Sourced from /Lang or detection.
script — ISO 15924 four-letter script code (e.g., Arab, Hebr, Hani, Hira, Hang, Deva, Thai, Latn). Derived from UAX #24 per-codepoint classification, taking the dominant script of the span.
direction — One of ltr, rtl, or ttb. Derived from UBA paragraph direction for horizontal text; ttb set when vertical writing mode is detected via CTM analysis and /WMode 1.
normalization — Array of normalization operations applied, in application order. Valid values: nfc, nfkc, nfd, nfkd, visual_order_reversed, ligature_expanded, presentation_forms_collapsed, combining_marks_stripped.
lang_confidence — Float in [0.0, 1.0] from the language detector. Omitted when lang is sourced from /Lang and no conflict was detected. Set to null when lang is und.
writing_mode — horizontal or vertical. vertical implies direction is ttb; tate-chu-yoko sub-spans within a vertical block carry direction: ltr and writing_mode: horizontal.

Blocks aggregate span metadata: the script and lang of a block are the modal values across its constituent spans. Blocks containing spans from more than one script carry a mixed_script: true flag and list all scripts in a scripts array alongside the dominant script field.

16 KiB Raw Blame History Unescape Escape