# PDF Fonts and Encoding: Technical Reference for Text Extraction This document describes every font type found in PDF files, how character codes are decoded to Unicode, and the data structures a Rust extraction engine must interpret. References are to the PDF 1.7 specification (ISO 32000-1:2008) and Adobe technical notes where applicable. --- ## 1. Font Types ### 1.1 Type 1 (Simple Font) Type 1 fonts originate from the Adobe Type 1 format, stored as PFB (binary) or PFA (ASCII) font programs. In a PDF the font dictionary has `/Subtype /Type1`. **Glyph storage.** The font program is a PostScript charstring program. When embedded, it appears under `/FontDescriptor` as the stream value of `/FontFile` (Type 1 binary). The charstrings are keyed by glyph name, not by a numeric glyph ID. **Character code interpretation.** A one-byte character code from the content stream is mapped through the font's `/Encoding` to a glyph name, then the glyph name is looked up in the charstring dictionary. See §3 for encoding details. **Widths.** The `/Widths` array (required) contains `LastChar - FirstChar + 1` entries, each giving the horizontal advance width in text-space units (1/1000 em). `/FirstChar` and `/LastChar` define the range. Codes outside this range use `/MissingWidth` from the font descriptor. **Standard 14 fonts.** PDF readers must implement the 14 standard Type 1 fonts (Helvetica, Times-Roman, Courier, Symbol, ZapfDingbats, and their variants) without an embedded font program. These are never embedded; the reader synthesizes metrics. ### 1.2 Type 3 (Simple Font) `/Subtype /Type3`. Glyphs are defined as PDF content streams directly in the font dictionary under `/CharProcs`, a dictionary from glyph name to stream. There is no external font program. **Character code interpretation.** One-byte code → glyph name via `/Encoding` → content stream in `/CharProcs`. Because glyph names are arbitrary (user-defined), there is often no reliable path to Unicode without a `/ToUnicode` CMap. If `/ToUnicode` is absent, extraction must fall back to glyph name heuristics or report the text as unresolvable. **Widths.** `/Widths`, `/FirstChar`, `/LastChar` as in Type 1. Additionally, `/FontMatrix` transforms glyph-space coordinates; the default for Type 1 is `[0.001 0 0 0.001 0 0]`, but Type 3 fonts frequently use `[1 0 0 1 0 0]` with glyph streams drawn at full size. ### 1.3 TrueType (Simple Font) `/Subtype /TrueType`. The embedded program is a TrueType font binary under `/FontFile2` in the font descriptor. **Glyph storage.** Glyphs are stored by integer glyph ID (GID) inside the `glyf` table. The `cmap` table maps Unicode codepoints (or platform-specific codes) to GIDs. **Character code interpretation.** One-byte code → glyph name via `/Encoding` → GID via the font's `cmap`. When the encoding is a standard PDF encoding (WinAnsiEncoding, MacRomanEncoding, etc.), the implementation maps code → Unicode codepoint → GID using `cmap` platform/encoding subtable (platform 3, encoding 1: Windows Unicode BMP). If the font's `cmap` contains only platform 1 (Macintosh), platform-specific code mappings apply. This is a common source of extraction errors. **Widths.** Same `/Widths` array mechanism as Type 1. The `hmtx` TrueType table provides the authoritative advance widths; the PDF `/Widths` array should match but may differ in broken documents. ### 1.4 Type 0 (Composite Font) `/Subtype /Type0`. This is the container for multi-byte (CJK and other large character set) text. The font dictionary has: - `/Encoding` — a CMap name (e.g., `Identity-H`) or a stream containing a CMap program. - `/DescendantFonts` — a one-element array holding a CIDFont dictionary. **Character code interpretation.** The multi-byte content stream codes are fed through the CMap named in `/Encoding`, which maps character codes to CIDs. The CIDFont then maps CIDs to GIDs. See §4. **Widths.** Widths are specified in the CIDFont descendant, not in the Type 0 dictionary itself. ### 1.5 CIDFont Type 0 (CFF-Based) `/Subtype /CIDFontType0` inside a `/DescendantFonts` array. The font program is a CFF (Compact Font Format, also called Type 2 charstrings) font embedded under `/FontFile3` with `/Subtype /CIDFontType0C` or `/Subtype /OpenType`. **Glyph storage.** CFF stores charstrings keyed by GID (integer index). GIDs map directly to charstrings; glyph names may or may not be present depending on the CFF variant. **Widths.** The CIDFont dictionary uses `/DW` (default width, default 1000) and `/W` (array of per-CID widths). The `/W` syntax is: an array whose elements alternate between `c [w1 w2 ...]` (individual CIDs) and `c1 c2 w` (range with uniform width). ### 1.6 CIDFont Type 2 (TrueType-Based) `/Subtype /CIDFontType2`. The embedded program is a TrueType or OpenType/TT font under `/FontFile2` (TrueType) or `/FontFile3` with `/Subtype /OpenType`. **CID-to-GID mapping.** The `/CIDToGIDMap` entry in the CIDFont dictionary is critical: - If the value is the name `/Identity`, CID equals GID directly (CID = GID). - Otherwise it is a stream of 2×65536 bytes: the GID for CID `n` is the 16-bit big-endian value at byte offset `2n`. **Widths.** Same `/DW` and `/W` mechanism as CIDFont Type 0. ### 1.7 OpenType in PDF OpenType fonts are embedded as `/FontFile3` streams with `/Subtype /OpenType`. An OpenType font may contain either CFF outlines (`CFF` table present → CIDFont Type 0) or TrueType outlines (`glyf` table present → CIDFont Type 2). The handling follows the respective CIDFont rules. The PDF spec does not treat OpenType as a separate subtype; it is identified by the stream subtype. --- ## 2. Encoding Mechanisms ### 2.1 Predefined Encodings The PDF spec defines four named encodings for simple fonts (§D.1–D.4, PDF 1.7): | Name | Character set | Typical use | |------|--------------|-------------| | `StandardEncoding` | 229 glyphs from the Adobe standard | Default for Type 1 fonts that omit `/Encoding` | | `MacRomanEncoding` | Mac OS Roman 256 code points | Older Mac-generated PDFs | | `WinAnsiEncoding` | Windows-1252 (cp1252) | Windows-generated PDFs; most common | | `MacExpertEncoding` | Expert font character set (fractions, small caps) | Rare; expert-set fonts | `PDFDocEncoding` is a PDF-internal encoding used for text strings in the document catalog (info dictionary, annotations) but **not** for font encoding; it must not be confused with font encodings. It extends Latin-1 by filling 0x18–0x1F and 0x80–0x9F with additional characters. `Symbol` and `ZapfDingbats` fonts use built-in symbol encodings defined in the respective AFM files. They do **not** use the standard named encodings; their code-to-glyph mapping is private and must be looked up against the font-specific tables provided in PDF Annex D. ### 2.2 The `/Encoding` Dictionary and `/Differences` Array When a font's `/Encoding` value is a dictionary rather than a name, the dictionary may contain: - `/Type /Encoding` (optional) - `/BaseEncoding` — a name (`StandardEncoding`, `MacRomanEncoding`, `WinAnsiEncoding`) designating the starting table. If absent, the base depends on font type (Type 1 defaults to built-in; others to StandardEncoding). - `/Differences` — an array of the form `[code name code name ...]` or `[code name name name ...]`. Starting from the numeric code, each following name overrides successive slots. Example: `[32 /space /exclam /quotedbl]` overrides slots 32, 33, 34. Encoding resolution algorithm for simple fonts: 1. Start from the BaseEncoding table. 2. Apply each `/Differences` entry, replacing the glyph name at the given code position. 3. Resolve each resulting glyph name to Unicode via the Adobe Glyph List (§5). ### 2.3 Symbol and ZapfDingbats These two standard fonts carry the `Symbolic` flag (bit 3 of `/Flags` in the font descriptor). Their encoding is defined entirely by the glyph names in the font program; the predefined named encodings do not apply. Extraction must use the AGL or the font's own encoding vector. ZapfDingbats glyph names are documented in the PDF spec Annex D.6. --- ## 3. ToUnicode CMaps ### 3.1 CMap Stream Format A ToUnicode CMap is a PostScript-inspired stream embedded directly in the PDF. The structure (PDF §9.10.3): ``` /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (UCS) def /Supplement 0 def end def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 4 beginbfchar <0041> <0041> % code 0x41 → U+0041 (A) <00A0> <00A0> % code 0xF001 → U+FB01 (fi ligature) % code 0xF002 → U+FB02 (fl ligature) endbfchar 1 beginbfrange <0061> <007A> <0061> % codes 0x61–0x7A → U+0061–U+007A (a–z) endbfrange endcmap CMapName currentdict /CMap defineresource pop end end ``` **`beginbfchar` / `endbfchar`:** Each entry is a pair ` `. The destination is UTF-16BE hex bytes; a surrogate pair encodes a codepoint above U+FFFF. **`beginbfrange` / `endbfrange`:** Range ` ` maps a contiguous code range to a contiguous Unicode range. Alternatively, ` [ ...]` maps each code in the range to the corresponding Unicode string in the array. **`begincidrange` / `endcidrange`:** Used in Type 0 CMaps (not ToUnicode) to map codes to CID ranges; see §4. ### 3.2 Embedding in PDF The ToUnicode CMap appears as the value of the `/ToUnicode` key in the font dictionary (both simple and composite fonts). It is a stream object, usually with `/Filter /FlateDecode`. ### 3.3 When ToUnicode is Absent or Wrong **Absent:** Extraction must fall back to encoding → glyph name → AGL lookup (simple fonts) or CID-to-Unicode tables derived from the predefined CMap ordering (composite fonts). Many PDFs produced by older tools (TeX-based pipelines, some CAD exporters) omit `/ToUnicode`; the AGL fallback is the only reliable option. **Wrong or incomplete:** Some generators emit a `/ToUnicode` CMap with missing entries or incorrect mappings. A bfchar entry with destination `<0000>` or `` signals an intentionally unmapped glyph. An implementation should not blindly trust all mappings; NUL and replacement-character destinations should be treated as absent. **Implications for extraction:** Without a `/ToUnicode` map, ligature glyphs (`fi`, `fl`, `ffi`, etc.) will be decoded as their AGL expansions (multi-character strings), which is usually correct. Private Use Area (PUA) codepoints require a `/ToUnicode` map to resolve; without one the extracted text should preserve the PUA codepoint but flag it as unresolved. --- ## 4. CID-to-GID Mapping (Composite Fonts) ### 4.1 Decoding Path For a Type 0 composite font, the decoding pipeline is: ``` content-stream bytes → CMap (named in /Encoding) → CID → GID (via CIDToGIDMap or CFF index) → glyph outline ``` The `/Encoding` CMap converts multi-byte character codes (1–4 bytes) to CIDs. The CMap may be: - A name referring to a predefined CMap (see §4.2). - A stream object containing a CMap program. ### 4.2 Predefined CMaps Adobe distributes predefined CMaps for CJK encodings (PDF Annex M). Key examples: | Name | Script | Code space | Notes | |------|--------|-----------|-------| | `Identity-H` | any (horizontal) | 2-byte | CID = code (identity) | | `Identity-V` | any (vertical) | 2-byte | CID = code, vertical writing | | `90ms-RKSJ-H` | Japanese | Shift-JIS | Maps SJIS codes → Adobe-Japan1 CIDs | | `GBK-EUC-H` | Simplified Chinese | GBK/EUC | Maps GBK → Adobe-GB1 CIDs | | `UniGB-UTF16-H` | Simplified Chinese | UTF-16BE | Unicode input → Adobe-GB1 CIDs | | `UniJIS-UTF16-H` | Japanese | UTF-16BE | Unicode input → Adobe-Japan1 CIDs | For `Identity-H`/`Identity-V`, the CID equals the raw 2-byte code value, and if `/CIDToGIDMap /Identity`, the GID equals the CID. These are the simplest cases for TrueType-based CIDFonts. ### 4.3 CIDSystemInfo Every CIDFont and its associated CMap must declare `/CIDSystemInfo`, a dictionary with `/Registry` (string), `/Ordering` (string), and `/Supplement` (integer). This identifies the CID character collection, e.g., Adobe-Japan1-6. The CIDFont and its CMap must share the same Registry and Ordering. Implementations should use this to select fallback Unicode tables when `/ToUnicode` is absent (Adobe publishes CID→Unicode mappings for its standard collections). --- ## 5. Glyph Name to Unicode (Adobe Glyph List) ### 5.1 The AGL The Adobe Glyph List (AGL, `aglfn.txt`, version 1.7) maps glyph names to Unicode scalar values. An implementation should embed the AGL as a static hash table (approximately 4,000 entries). **Algorithmic fallback** (AGL specification §2): If a glyph name is not in the AGL table: 1. Strip any trailing `.` (e.g., `A.sc` → `A`). 2. If the name starts with `uni`, parse the following hex digits as UTF-16BE codepoint(s): `uni0041` → U+0041. 3. If the name starts with `u`, parse the following hex as a Unicode scalar: `u1F600` → U+1F600. 4. If none of the above, the glyph is unmapped. **Ligatures.** `fi` → U+FB01, `fl` → U+FB02, `ffi` → U+FB03, `ffl` → U+FB04. These are single AGL entries mapping to single Unicode codepoints. Many extraction engines prefer to expand ligatures to their component characters (fi → "fi") for searchability; this is a policy choice, not a spec requirement. **`.notdef`.** The glyph named `.notdef` is the fallback glyph for unmapped codes. It has no Unicode mapping. Extractors should silently skip or emit U+FFFD for `.notdef`. **`afii` names.** Legacy glyph names starting with `afii` (e.g., `afii57506`) appear in older Arabic and Hebrew fonts. The AGL maps these to their correct Unicode codepoints; no special handling beyond AGL lookup is needed. --- ## 6. Font Descriptors The `/FontDescriptor` dictionary (§9.8, PDF 1.7) is referenced by the font dictionary via `/FontDescriptor`. It provides metrics and the embedded font binary. ### 6.1 Key Entries | Key | Type | Description | |-----|------|-------------| | `/FontName` | name | PostScript name of the font | | `/FontBBox` | rectangle | Glyph bounding box in glyph-space units | | `/Flags` | integer | Bitfield describing font characteristics | | `/ItalicAngle` | number | Dominant italic angle in degrees | | `/Ascent` | number | Maximum ascent above baseline | | `/Descent` | number | Maximum descent below baseline (negative) | | `/CapHeight` | number | Height of capital letters | | `/XHeight` | number | Height of lowercase letters | | `/StemV` | number | Dominant vertical stem width | | `/FontFile` | stream | Type 1 PFB data | | `/FontFile2` | stream | TrueType binary | | `/FontFile3` | stream | CFF, OpenType, or CIDFontType0C binary (identified by stream `/Subtype`) | ### 6.2 Flags Bitfield The `/Flags` integer is a 32-bit field; bits are numbered from 1 (LSB). Key bits: | Bit | Mask | Meaning | |-----|------|---------| | 1 | 0x0001 | FixedPitch | | 2 | 0x0002 | Serif | | 3 | 0x0004 | Symbolic — font uses a private encoding; standard encodings do not apply | | 4 | 0x0008 | Script (cursive) | | 6 | 0x0020 | Nonsymbolic — font uses a standard Latin encoding | | 7 | 0x0040 | Italic | | 17 | 0x10000 | AllCap | | 18 | 0x20000 | SmallCap | | 19 | 0x40000 | ForceBold | The `Symbolic` (bit 3) and `Nonsymbolic` (bit 6) flags are mutually exclusive and affect encoding resolution: a symbolic font's encoding is its own built-in table; a nonsymbolic font follows the standard named encoding fallback rules. ### 6.3 Inferring Unicode When CMap Data Is Absent When both `/ToUnicode` and a useful `/Encoding` are missing, the following heuristics apply, in order: 1. If the embedded font is TrueType (`/FontFile2`) and the `/Flags` `Nonsymbolic` bit is set, use the font's `cmap` table with the `WinAnsiEncoding` assumption (platform 3, encoding 1). 2. If the font is CFF (`/FontFile3` with `/Subtype /CIDFontType0C`), the CFF `charset` table may supply glyph names; apply AGL. 3. If `/FontName` identifies a known standard font (e.g., `Symbol`, `ZapfDingbats`), apply the font-specific encoding table from PDF Annex D. 4. Otherwise, emit PUA codepoints or U+FFFD and flag the text as requiring post-processing. The font descriptor `/FontBBox` and `/Flags` provide no path to Unicode; they are useful only for layout heuristics (detecting whitespace, line boundaries) when Unicode resolution fails. --- ## Appendix: Key Dictionary Locations ``` /Font dictionary /Subtype → Type1 | Type3 | TrueType | Type0 | CIDFontType0 | CIDFontType2 /Encoding → name or dictionary (simple); CMap name or stream (Type0) /ToUnicode → stream (CMap program) /FontDescriptor → dictionary /Flags → integer (bitfield) /FontFile → stream (Type 1) /FontFile2 → stream (TrueType) /FontFile3 → stream (CFF/OpenType; /Subtype in stream dict) /Widths → array (simple fonts) /FirstChar → integer /LastChar → integer /DescendantFonts → array [ CIDFont dict ] (Type0 only) CIDFont dictionary (inside /DescendantFonts) /Subtype → CIDFontType0 | CIDFontType2 /CIDSystemInfo → dict (/Registry /Ordering /Supplement) /DW → integer (default advance width) /W → array (per-CID widths) /CIDToGIDMap → /Identity or stream (CIDFontType2 only) /FontDescriptor → dictionary (as above) ``` --- *Spec references: ISO 32000-1:2008 §9 (Fonts), §D (Character Sets), §M (Predefined CMaps); Adobe Glyph List Specification v1.7; Adobe Type 1 Font Format (Black Book); Adobe CMap and CIDFont Files Specification v1.0.*