jedarden 04b60a1cf7 Add three research documents: CJK encoding, pipeline synthesis, linearization

- cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0
  composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1,
  Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via
  Adobe CID tables, full-width normalization, vertical text detection
- extraction-pipeline-overview: end-to-end 9-stage synthesis referencing
  all 36 research documents; stages: file open, metadata, page classification,
  content extraction (4 sub-paths), font pipeline, span assembly, normalization
  and quality, supplementary content, output serialization; ASCII data-flow
  diagram
- linearized-pdf-and-streaming: linearization dict keys, hint stream
  bitfield tables, first-page xref lazy parsing, HTTP range request pattern,
  staleness validation, incremental update interaction, NDJSON streaming,
  partial file extraction, lazy PageIter API with rayon par_bridge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:26:36 -04:00

15 KiB

Raw Blame History

CJK and Asian Script Encoding in PDF

CJK documents—Chinese, Japanese, Korean—are among the most common non-Latin PDFs in the wild. Their encoding pipelines differ fundamentally from Latin-script PDFs: multi-byte code spaces, large predefined CMaps, identity mappings, and character sets defined by government standards rather than Unicode consortiums. This document covers the full encoding stack that pdftract must understand to produce readable text from CJK sources.

1. CJK Encoding Systems Overview

CJK text in PDF derives from legacy national encoding standards that predate Unicode by decades.

Japanese uses three encodings in PDF contexts:

Shift-JIS (SJIS): variable-width 1–2 byte encoding. The dominant encoding for Japanese Windows software. Covers JIS X 0208 kanji plus hiragana, katakana, and half-width katakana.
EUC-JP: Extended Unix Code, 2-byte encoding used on Unix systems, also covers JIS X 0208 with a simpler lead-byte scheme (0xA1–0xFE range).
JIS X 0208: the underlying 94×94 character table that Shift-JIS and EUC-JP both reference; not itself a byte encoding but the source character set.

Chinese Simplified uses:

GB2312 (1981): 6,763 characters in a 94×94 table, 2-byte EUC-style encoding (lead 0xA1–0xFE).
GBK (1993): extends GB2312 to 20,902 characters; lead bytes 0x81–0xFE, trail 0x40–0xFE.
GB18030 (2000, mandatory): extends GBK with 4-byte sequences, covering all Unicode planes.

Chinese Traditional uses:

Big5: 2-byte encoding covering ~13,000 traditional characters, widely used in Taiwan and Hong Kong.
Big5-HKSCS: Hong Kong government extension adding characters for Cantonese and Hong Kong-specific usage.

Korean uses:

EUC-KR: 2-byte encoding based on KS X 1001 (formerly KSC 5601), covering ~2,350 Hangul syllables plus Hanja.
CP949 (Unified Hangul Code): Microsoft extension of EUC-KR covering all 11,172 modern Hangul syllables.

Unicode-based encodings in PDF: Identity-H and Identity-H-variant CMaps treat the 2-byte character code directly as a CID, which is then equal to the Unicode codepoint in many modern CJK PDFs generated by applications that use OpenType CFF fonts with Unicode CMAPs internally.

Because CJK character sets have thousands to tens of thousands of codepoints, they cannot fit in a Type 1 or TrueType simple font (limited to 256 glyphs). This is why CJK PDFs overwhelmingly use Type 0 composite fonts. A Type 0 font references a CIDFont (a font whose glyph space is indexed by character IDs rather than a 256-entry encoding vector) and a CMap that maps byte sequences to CIDs.

2. Type 0 Composite Font Structure

A Type 0 font dictionary contains:

/Type /Font
/Subtype /Type0
/BaseFont /HeiseiKakuGo-W5          % or a subset tag + font name
/Encoding /90ms-RKSJ-H              % a CMap name or inline stream
/DescendantFonts [<<...>>]          % always an array of exactly one CIDFont dict
/ToUnicode stream                   % optional but critical for text extraction

The CIDFont dictionary (the single element in DescendantFonts) contains:

/CIDSystemInfo: dictionary with /Registry (e.g., Adobe), /Ordering (e.g., Japan1), /Supplement (integer). This identifies the character collection and its version. Key values: Adobe/Japan1, Adobe/CNS1, Adobe/GB1, Adobe/Korea1.
/DW: default glyph width in glyph-space units (1/1000 of a text unit). Typically 1000 for full-width CJK glyphs.
/W: width exceptions array. Format: [startCID [w1 w2 ... wn]] or [startCID endCID w]. Essential for correct glyph advance computation.
/CIDToGIDMap: either the name /Identity (CID equals GID in the embedded font file) or a stream of 2-byte big-endian GID values indexed by CID.

The encoding pipeline for a CJK text string is:

raw bytes → CMap lookup → CID → CIDToGIDMap → GID → glyph in font file

For text extraction, pdftract needs: raw bytes → CMap lookup → CID → ToUnicode (if present) → Unicode codepoint, or, lacking ToUnicode, CID → compiled-in CID-to-Unicode table for the given CIDSystemInfo.

3. Predefined CMap Names

ISO 32000 Annex D defines the predefined CMaps that a conforming PDF processor must know without an embedded stream. These must be compiled into pdftract as lookup tables.

Japanese (Adobe/Japan1):

CMap Name	Encoding	Direction
`83pv-RKSJ-H`	Shift-JIS (1983 JIS)	horizontal
`90ms-RKSJ-H`	Shift-JIS (MS Windows)	horizontal
`90ms-RKSJ-V`	Shift-JIS (MS Windows)	vertical
`90msp-RKSJ-H`	Shift-JIS proportional	horizontal
`EUC-H`	EUC-JP	horizontal
`EUC-V`	EUC-JP	vertical
`UniJIS-UTF16-H`	UTF-16 → Japan1 CIDs	horizontal
`UniJIS-UTF16-V`	UTF-16 → Japan1 CIDs	vertical
`UniJIS2004-UTF32-H`	UTF-32 (Unicode 2004)	horizontal

Chinese Simplified (Adobe/GB1):

CMap Name	Encoding
`GB-EUC-H`	GB2312 EUC
`GBT-EUC-H`	GB2312 Traditional EUC
`UniGB-UCS2-H`	UCS-2 → GB1 CIDs
`UniGB-UTF16-H`	UTF-16 → GB1 CIDs

Chinese Traditional (Adobe/CNS1):

CMap Name	Encoding
`ETen-B5-H`	Big5 (ETen extension)
`ETen-B5-V`	Big5 (ETen extension), vertical
`UniCNS-UCS2-H`	UCS-2 → CNS1 CIDs
`UniCNS-UTF16-H`	UTF-16 → CNS1 CIDs

Korean (Adobe/Korea1):

CMap Name	Encoding
`KSC-EUC-H`	EUC-KR
`KSC-EUC-V`	EUC-KR, vertical
`UniKS-UCS2-H`	UCS-2 → Korea1 CIDs
`UniKS-UTF16-H`	UTF-16 → Korea1 CIDs

Universal pass-throughs: Identity-H and Identity-V treat the 2-byte big-endian character code directly as the CID. Used by modern tools generating Unicode-mapped CJK fonts.

Implementation: store each predefined CMap as a sorted &[(u16, u16)] slice of (code, cid) pairs in a static array. For variable-width CMaps (Shift-JIS, GB18030), represent the codespace as a trie or range table keyed on the lead byte.

4. Shift-JIS Encoding in Detail

Shift-JIS is a variable-width encoding:

Single-byte 0x00–0x7F: ASCII-compatible.
Single-byte 0xA1–0xDF: half-width katakana (ｦ–ﾟ, 63 characters). No second byte follows.
Lead bytes 0x81–0x9F and 0xE0–0xFC: introduce a 2-byte sequence. The trail byte range is 0x40–0x7E and 0x80–0xFC (i.e., anything except 0x7F).

The 2-byte pairs map to JIS X 0208 row/column indices via:

row = (lead - (lead < 0xA0 ? 0x70 : 0xB0)) * 2 - (trail < 0x9F ? 1 : 0)
col = trail - (trail < 0x7F ? 0x1F : trail < 0x9F ? 0x20 : 0x7E)

Each JIS X 0208 cell maps to a Unicode codepoint via the published 94×94 table. The full Shift-JIS→Unicode mapping has approximately 6,879 entries.

CP932 (Windows Shift-JIS): adds NEC special characters (0x8740–0x879C), IBM extension characters (0xFA40–0xFC4B), and maps 0x80 → U+005C (backslash in some contexts). pdftract should treat 90ms-RKSJ-H as CP932 specifically, not plain Shift-JIS, as it targets Windows-generated PDFs.

5. GB18030 Encoding

GB18030 is China's mandatory national standard since 2000. It is a multi-length encoding:

1-byte 0x00–0x7F: ASCII.
2-byte: lead 0x81–0xFE, trail 0x40–0xFE (excluding 0x7F). Covers GBK characters (~20,000 codepoints).
4-byte: lead 0x81–0xFE, second 0x30–0x39, third 0x81–0xFE, fourth 0x30–0x39. Covers the remainder of Unicode through plane 16.

The 4-byte space provides a linear mapping to Unicode codepoints via a range table: GB18030 4-byte values map to Unicode in monotonically increasing order, enabling binary search over ~1,787 range entries from Adobe's published GB18030→Unicode table.

In PDF, GB18030 content is identified by CIDSystemInfo with /Registry (Adobe) /Ordering (GB1). The CMap UniGB-UTF16-H maps UTF-16 codes to Adobe/GB1 CIDs. The GB1 character collection contains ~30,284 glyphs as of supplement 5.

6. Big5 and Big5-HKSCS

Big5 is a 2-byte encoding:

Lead bytes: 0xA1–0xFE.
Trail bytes: 0x40–0x7E and 0xA1–0xFE (gap at 0x7F–0xA0).
Total: ~13,053 Traditional Chinese characters, mapped to Unicode via the CNS 11643 standard.

The ETen extension (used by ETen-B5-H) adds characters at lead bytes 0xC6–0xC8 and 0xF9 ranges, commonly seen in Taiwanese documents.

Big5-HKSCS (Hong Kong Supplementary Character Set, 2016 edition) adds:

Characters in 0x8740–0xA0FE (lead bytes below the standard Big5 range).
Additional characters in 0xC6A1–0xC8FE.
Maps to Unicode including characters outside the BMP (requires surrogate pairs in UTF-16 or 4-byte UTF-8).

Detected via CIDSystemInfo /Ordering (CNS1). The CNS1 collection covers planes 1–7 of CNS 11643. pdftract should carry both the base Big5 mapping table and the HKSCS delta table (~5,000 additional entries).

7. ToUnicode CMaps for CJK

When present, a ToUnicode stream is the most reliable path to Unicode output. CJK ToUnicode CMaps commonly use beginbfrange to cover large contiguous blocks:

beginbfrange
<A1A1> <A1FE> [<U1> <U2> ... <U94>]   % row A1 of the 94×94 table
...
<F7A1> <F7FE> [...]                     % last row
endbfrange

Some CMaps use a simpler linear bfrange when the Unicode mapping is contiguous:

<4E00> <9FFF> <4E00>   % CJK Unified Ideographs: CID == Unicode codepoint

Unicode block coverage to expect in CJK ToUnicode CMaps:

U+3040–U+309F: Hiragana
U+30A0–U+30FF: Katakana
U+4E00–U+9FFF: CJK Unified Ideographs
U+3400–U+4DBF: CJK Extension A
U+20000–U+2A6DF: CJK Extension B (requires surrogate pairs in UTF-16 bfrange entries)
U+AC00–U+D7AF: Hangul Syllables
U+F900–U+FAFF: CJK Compatibility Ideographs

Validate extracted CJK codepoints against these ranges relative to CIDSystemInfo /Ordering. A Japanese PDF should not produce Hangul; if it does, the CMap was misread. Identity-mapped CMaps (where CID equals Unicode codepoint) appear commonly with UniJIS-UTF16-H and modern OpenType-based tools—in these cases ToUnicode is often omitted and the CID is used directly as a Unicode scalar value.

8. Missing ToUnicode Recovery for CJK

Many CJK PDFs, especially older ones produced by Japanese or Chinese desktop publishing software, omit ToUnicode. Recovery requires:

Identify the character collection from CIDSystemInfo: Registry + Ordering + Supplement determines which Adobe CID table applies.
Look up CID in the compiled-in table: Adobe publishes CID-to-Unicode mapping files for each collection:
- Adobe-Japan1-UCS2.txt: ~14,664 entries mapping Japan1 CIDs to Unicode.
- Adobe-CNS1-UCS2.txt: ~18,964 entries for CNS1.
- Adobe-GB1-UCS2.txt: ~30,284 entries for GB1.
- Adobe-Korea1-UCS2.txt: ~18,352 entries for Korea1.

These files are freely redistributable. Compile each into a sorted &[(u16, u32)] static slice (CID → Unicode scalar). At runtime, binary-search by CID. For CIDs mapping to multiple Unicode codepoints (compatibility variants), store the primary mapping.

For very large tables (Japan1), a 64 KB memory-mapped file loaded once at startup is more practical than a static array; alternatively, the adobe-cid-tables crate can provide compiled-in data.

9. Full-Width and Half-Width Normalization

CJK documents routinely mix full-width and half-width character forms:

Full-width ASCII/Latin: U+FF01 (！) through U+FF5E (～). Appear in Japanese text for typographic consistency.
Full-width currency symbols: U+FFE0–U+FFE6 (e.g., U+FFE5 ￥).
Half-width katakana: U+FF65–U+FF9F. Commonly appear in older Japanese documents and data entry.
Full-width katakana: U+30A0–U+30FF. The standard form in modern Japanese.

For search and indexing, apply NFKC normalization: full-width Latin → ASCII, half-width katakana → full-width katakana. This ensures Ａ (U+FF21) matches A (U+0041) in search.

For display output, preserve the original forms. pdftract should expose a normalization flag; the default for its text extraction output should be to preserve, with NFKC normalization available as a post-processing step.

10. Vertical CJK Text Extraction

Japanese documents—books, newspapers, legal documents—frequently use vertical writing mode (top-to-bottom, right-to-left column order).

Detection:

The CMap name ends in -V (e.g., 90ms-RKSJ-V, UniJIS-UTF16-V). Check the /Encoding value in the Type 0 font dictionary.
The CMap stream contains /WMode 1 in its dictionary section.
The CTM (current transformation matrix) for text-drawing operators shows a 90° rotation (approximately [0 -1 1 0 tx ty] or [0 1 -1 0 tx ty]).

Vertical glyph substitutions: vertical CMaps substitute specific glyphs—brackets, parentheses, and punctuation rotate to their vertical forms. CIDs in the vertical range (e.g., Japan1 CIDs 8284–8285 for vertical brackets) should map to the same Unicode codepoint as their horizontal counterparts (U+FF08/U+FF09, not a separate codepoint) since Unicode encodes only the logical character, not the presentation form.

Tate-chu-yoko (縦中横): short sequences of Latin characters or digits (e.g., "20", "AB") typeset horizontally inline within vertical text. These appear as a horizontal text run with a rotation in the CTM. Detect by the surrounding WMode context and the CTM rotation reversal; output the characters inline in logical order.

Column reconstruction: vertical Japanese text reads top-to-bottom within a column, and columns read right-to-left. After extracting character positions, sort glyphs first by X position descending (right column first), then by Y position descending (top first) within each column. Expose writing_mode: "ttb" in the per-page metadata so downstream consumers can reflow correctly.

Implementation Priority

For pdftract, the recommended implementation order:

Embed predefined CMap lookup tables as static byte slices compiled from Adobe's Annex D definitions.
Implement Shift-JIS (CP932) and EUC-JP decoders; these cover the majority of Japanese PDF traffic.
Implement GBK/GB18030 decoder for Chinese Simplified.
Implement Big5/ETen decoder for Chinese Traditional.
Implement EUC-KR/CP949 decoder for Korean.
Compile Adobe CID-to-Unicode tables as static sorted arrays for ToUnicode-absent recovery.
Add WMode detection and vertical text column sorting.
Expose normalization flags for full-width/half-width conversion.

Each encoding decoder should return Option<char> (or an iterator of char) given a byte slice and current position, advancing the position by 1, 2, or 4 bytes. Feed the resulting CID to a CMap lookup, then to the Unicode resolution layer.

15 KiB Raw Blame History Unescape Escape