pdftract/docs/research/pdf-fonts-and-encoding.md
jedarden c2870e6640 Add research docs and SDK invocation notes
Four research documents covering PDF spec fundamentals, font types and
encoding, glyph Unicode recovery, and tagged PDF structure/reading order.
SDK invocation notes with subprocess and HTTP examples for Python, Node.js,
Go, Ruby, Java, Rust, and Bash.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:33:34 -04:00

300 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PDF Fonts and Encoding: Technical Reference for Text Extraction
This document describes every font type found in PDF files, how character codes are decoded to Unicode, and the data structures a Rust extraction engine must interpret. References are to the PDF 1.7 specification (ISO 32000-1:2008) and Adobe technical notes where applicable.
---
## 1. Font Types
### 1.1 Type 1 (Simple Font)
Type 1 fonts originate from the Adobe Type 1 format, stored as PFB (binary) or PFA (ASCII) font programs. In a PDF the font dictionary has `/Subtype /Type1`.
**Glyph storage.** The font program is a PostScript charstring program. When embedded, it appears under `/FontDescriptor` as the stream value of `/FontFile` (Type 1 binary). The charstrings are keyed by glyph name, not by a numeric glyph ID.
**Character code interpretation.** A one-byte character code from the content stream is mapped through the font's `/Encoding` to a glyph name, then the glyph name is looked up in the charstring dictionary. See §3 for encoding details.
**Widths.** The `/Widths` array (required) contains `LastChar - FirstChar + 1` entries, each giving the horizontal advance width in text-space units (1/1000 em). `/FirstChar` and `/LastChar` define the range. Codes outside this range use `/MissingWidth` from the font descriptor.
**Standard 14 fonts.** PDF readers must implement the 14 standard Type 1 fonts (Helvetica, Times-Roman, Courier, Symbol, ZapfDingbats, and their variants) without an embedded font program. These are never embedded; the reader synthesizes metrics.
### 1.2 Type 3 (Simple Font)
`/Subtype /Type3`. Glyphs are defined as PDF content streams directly in the font dictionary under `/CharProcs`, a dictionary from glyph name to stream. There is no external font program.
**Character code interpretation.** One-byte code → glyph name via `/Encoding` → content stream in `/CharProcs`. Because glyph names are arbitrary (user-defined), there is often no reliable path to Unicode without a `/ToUnicode` CMap. If `/ToUnicode` is absent, extraction must fall back to glyph name heuristics or report the text as unresolvable.
**Widths.** `/Widths`, `/FirstChar`, `/LastChar` as in Type 1. Additionally, `/FontMatrix` transforms glyph-space coordinates; the default for Type 1 is `[0.001 0 0 0.001 0 0]`, but Type 3 fonts frequently use `[1 0 0 1 0 0]` with glyph streams drawn at full size.
### 1.3 TrueType (Simple Font)
`/Subtype /TrueType`. The embedded program is a TrueType font binary under `/FontFile2` in the font descriptor.
**Glyph storage.** Glyphs are stored by integer glyph ID (GID) inside the `glyf` table. The `cmap` table maps Unicode codepoints (or platform-specific codes) to GIDs.
**Character code interpretation.** One-byte code → glyph name via `/Encoding` → GID via the font's `cmap`. When the encoding is a standard PDF encoding (WinAnsiEncoding, MacRomanEncoding, etc.), the implementation maps code → Unicode codepoint → GID using `cmap` platform/encoding subtable (platform 3, encoding 1: Windows Unicode BMP). If the font's `cmap` contains only platform 1 (Macintosh), platform-specific code mappings apply. This is a common source of extraction errors.
**Widths.** Same `/Widths` array mechanism as Type 1. The `hmtx` TrueType table provides the authoritative advance widths; the PDF `/Widths` array should match but may differ in broken documents.
### 1.4 Type 0 (Composite Font)
`/Subtype /Type0`. This is the container for multi-byte (CJK and other large character set) text. The font dictionary has:
- `/Encoding` — a CMap name (e.g., `Identity-H`) or a stream containing a CMap program.
- `/DescendantFonts` — a one-element array holding a CIDFont dictionary.
**Character code interpretation.** The multi-byte content stream codes are fed through the CMap named in `/Encoding`, which maps character codes to CIDs. The CIDFont then maps CIDs to GIDs. See §4.
**Widths.** Widths are specified in the CIDFont descendant, not in the Type 0 dictionary itself.
### 1.5 CIDFont Type 0 (CFF-Based)
`/Subtype /CIDFontType0` inside a `/DescendantFonts` array. The font program is a CFF (Compact Font Format, also called Type 2 charstrings) font embedded under `/FontFile3` with `/Subtype /CIDFontType0C` or `/Subtype /OpenType`.
**Glyph storage.** CFF stores charstrings keyed by GID (integer index). GIDs map directly to charstrings; glyph names may or may not be present depending on the CFF variant.
**Widths.** The CIDFont dictionary uses `/DW` (default width, default 1000) and `/W` (array of per-CID widths). The `/W` syntax is: an array whose elements alternate between `c [w1 w2 ...]` (individual CIDs) and `c1 c2 w` (range with uniform width).
### 1.6 CIDFont Type 2 (TrueType-Based)
`/Subtype /CIDFontType2`. The embedded program is a TrueType or OpenType/TT font under `/FontFile2` (TrueType) or `/FontFile3` with `/Subtype /OpenType`.
**CID-to-GID mapping.** The `/CIDToGIDMap` entry in the CIDFont dictionary is critical:
- If the value is the name `/Identity`, CID equals GID directly (CID = GID).
- Otherwise it is a stream of 2×65536 bytes: the GID for CID `n` is the 16-bit big-endian value at byte offset `2n`.
**Widths.** Same `/DW` and `/W` mechanism as CIDFont Type 0.
### 1.7 OpenType in PDF
OpenType fonts are embedded as `/FontFile3` streams with `/Subtype /OpenType`. An OpenType font may contain either CFF outlines (`CFF` table present → CIDFont Type 0) or TrueType outlines (`glyf` table present → CIDFont Type 2). The handling follows the respective CIDFont rules. The PDF spec does not treat OpenType as a separate subtype; it is identified by the stream subtype.
---
## 2. Encoding Mechanisms
### 2.1 Predefined Encodings
The PDF spec defines four named encodings for simple fonts (§D.1D.4, PDF 1.7):
| Name | Character set | Typical use |
|------|--------------|-------------|
| `StandardEncoding` | 229 glyphs from the Adobe standard | Default for Type 1 fonts that omit `/Encoding` |
| `MacRomanEncoding` | Mac OS Roman 256 code points | Older Mac-generated PDFs |
| `WinAnsiEncoding` | Windows-1252 (cp1252) | Windows-generated PDFs; most common |
| `MacExpertEncoding` | Expert font character set (fractions, small caps) | Rare; expert-set fonts |
`PDFDocEncoding` is a PDF-internal encoding used for text strings in the document catalog (info dictionary, annotations) but **not** for font encoding; it must not be confused with font encodings. It extends Latin-1 by filling 0x180x1F and 0x800x9F with additional characters.
`Symbol` and `ZapfDingbats` fonts use built-in symbol encodings defined in the respective AFM files. They do **not** use the standard named encodings; their code-to-glyph mapping is private and must be looked up against the font-specific tables provided in PDF Annex D.
### 2.2 The `/Encoding` Dictionary and `/Differences` Array
When a font's `/Encoding` value is a dictionary rather than a name, the dictionary may contain:
- `/Type /Encoding` (optional)
- `/BaseEncoding` — a name (`StandardEncoding`, `MacRomanEncoding`, `WinAnsiEncoding`) designating the starting table. If absent, the base depends on font type (Type 1 defaults to built-in; others to StandardEncoding).
- `/Differences` — an array of the form `[code name code name ...]` or `[code name name name ...]`. Starting from the numeric code, each following name overrides successive slots. Example: `[32 /space /exclam /quotedbl]` overrides slots 32, 33, 34.
Encoding resolution algorithm for simple fonts:
1. Start from the BaseEncoding table.
2. Apply each `/Differences` entry, replacing the glyph name at the given code position.
3. Resolve each resulting glyph name to Unicode via the Adobe Glyph List (§5).
### 2.3 Symbol and ZapfDingbats
These two standard fonts carry the `Symbolic` flag (bit 3 of `/Flags` in the font descriptor). Their encoding is defined entirely by the glyph names in the font program; the predefined named encodings do not apply. Extraction must use the AGL or the font's own encoding vector. ZapfDingbats glyph names are documented in the PDF spec Annex D.6.
---
## 3. ToUnicode CMaps
### 3.1 CMap Stream Format
A ToUnicode CMap is a PostScript-inspired stream embedded directly in the PDF. The structure (PDF §9.10.3):
```
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (UCS) def
/Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
4 beginbfchar
<0041> <0041> % code 0x41 → U+0041 (A)
<00A0> <00A0>
<F001> <FB01> % code 0xF001 → U+FB01 (fi ligature)
<F002> <FB02> % code 0xF002 → U+FB02 (fl ligature)
endbfchar
1 beginbfrange
<0061> <007A> <0061> % codes 0x610x7A → U+0061U+007A (az)
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
```
**`beginbfchar` / `endbfchar`:** Each entry is a pair `<src-code> <dst-unicode>`. The destination is UTF-16BE hex bytes; a surrogate pair encodes a codepoint above U+FFFF.
**`beginbfrange` / `endbfrange`:** Range `<lo> <hi> <start-unicode>` maps a contiguous code range to a contiguous Unicode range. Alternatively, `<lo> <hi> [<u1> <u2> ...]` maps each code in the range to the corresponding Unicode string in the array.
**`begincidrange` / `endcidrange`:** Used in Type 0 CMaps (not ToUnicode) to map codes to CID ranges; see §4.
### 3.2 Embedding in PDF
The ToUnicode CMap appears as the value of the `/ToUnicode` key in the font dictionary (both simple and composite fonts). It is a stream object, usually with `/Filter /FlateDecode`.
### 3.3 When ToUnicode is Absent or Wrong
**Absent:** Extraction must fall back to encoding → glyph name → AGL lookup (simple fonts) or CID-to-Unicode tables derived from the predefined CMap ordering (composite fonts). Many PDFs produced by older tools (TeX-based pipelines, some CAD exporters) omit `/ToUnicode`; the AGL fallback is the only reliable option.
**Wrong or incomplete:** Some generators emit a `/ToUnicode` CMap with missing entries or incorrect mappings. A bfchar entry with destination `<0000>` or `<FFFD>` signals an intentionally unmapped glyph. An implementation should not blindly trust all mappings; NUL and replacement-character destinations should be treated as absent.
**Implications for extraction:** Without a `/ToUnicode` map, ligature glyphs (`fi`, `fl`, `ffi`, etc.) will be decoded as their AGL expansions (multi-character strings), which is usually correct. Private Use Area (PUA) codepoints require a `/ToUnicode` map to resolve; without one the extracted text should preserve the PUA codepoint but flag it as unresolved.
---
## 4. CID-to-GID Mapping (Composite Fonts)
### 4.1 Decoding Path
For a Type 0 composite font, the decoding pipeline is:
```
content-stream bytes
→ CMap (named in /Encoding)
→ CID
→ GID (via CIDToGIDMap or CFF index)
→ glyph outline
```
The `/Encoding` CMap converts multi-byte character codes (14 bytes) to CIDs. The CMap may be:
- A name referring to a predefined CMap (see §4.2).
- A stream object containing a CMap program.
### 4.2 Predefined CMaps
Adobe distributes predefined CMaps for CJK encodings (PDF Annex M). Key examples:
| Name | Script | Code space | Notes |
|------|--------|-----------|-------|
| `Identity-H` | any (horizontal) | 2-byte | CID = code (identity) |
| `Identity-V` | any (vertical) | 2-byte | CID = code, vertical writing |
| `90ms-RKSJ-H` | Japanese | Shift-JIS | Maps SJIS codes → Adobe-Japan1 CIDs |
| `GBK-EUC-H` | Simplified Chinese | GBK/EUC | Maps GBK → Adobe-GB1 CIDs |
| `UniGB-UTF16-H` | Simplified Chinese | UTF-16BE | Unicode input → Adobe-GB1 CIDs |
| `UniJIS-UTF16-H` | Japanese | UTF-16BE | Unicode input → Adobe-Japan1 CIDs |
For `Identity-H`/`Identity-V`, the CID equals the raw 2-byte code value, and if `/CIDToGIDMap /Identity`, the GID equals the CID. These are the simplest cases for TrueType-based CIDFonts.
### 4.3 CIDSystemInfo
Every CIDFont and its associated CMap must declare `/CIDSystemInfo`, a dictionary with `/Registry` (string), `/Ordering` (string), and `/Supplement` (integer). This identifies the CID character collection, e.g., Adobe-Japan1-6. The CIDFont and its CMap must share the same Registry and Ordering. Implementations should use this to select fallback Unicode tables when `/ToUnicode` is absent (Adobe publishes CID→Unicode mappings for its standard collections).
---
## 5. Glyph Name to Unicode (Adobe Glyph List)
### 5.1 The AGL
The Adobe Glyph List (AGL, `aglfn.txt`, version 1.7) maps glyph names to Unicode scalar values. An implementation should embed the AGL as a static hash table (approximately 4,000 entries).
**Algorithmic fallback** (AGL specification §2): If a glyph name is not in the AGL table:
1. Strip any trailing `.<suffix>` (e.g., `A.sc``A`).
2. If the name starts with `uni`, parse the following hex digits as UTF-16BE codepoint(s): `uni0041` → U+0041.
3. If the name starts with `u`, parse the following hex as a Unicode scalar: `u1F600` → U+1F600.
4. If none of the above, the glyph is unmapped.
**Ligatures.** `fi` → U+FB01, `fl` → U+FB02, `ffi` → U+FB03, `ffl` → U+FB04. These are single AGL entries mapping to single Unicode codepoints. Many extraction engines prefer to expand ligatures to their component characters (fi → "fi") for searchability; this is a policy choice, not a spec requirement.
**`.notdef`.** The glyph named `.notdef` is the fallback glyph for unmapped codes. It has no Unicode mapping. Extractors should silently skip or emit U+FFFD for `.notdef`.
**`afii` names.** Legacy glyph names starting with `afii` (e.g., `afii57506`) appear in older Arabic and Hebrew fonts. The AGL maps these to their correct Unicode codepoints; no special handling beyond AGL lookup is needed.
---
## 6. Font Descriptors
The `/FontDescriptor` dictionary (§9.8, PDF 1.7) is referenced by the font dictionary via `/FontDescriptor`. It provides metrics and the embedded font binary.
### 6.1 Key Entries
| Key | Type | Description |
|-----|------|-------------|
| `/FontName` | name | PostScript name of the font |
| `/FontBBox` | rectangle | Glyph bounding box in glyph-space units |
| `/Flags` | integer | Bitfield describing font characteristics |
| `/ItalicAngle` | number | Dominant italic angle in degrees |
| `/Ascent` | number | Maximum ascent above baseline |
| `/Descent` | number | Maximum descent below baseline (negative) |
| `/CapHeight` | number | Height of capital letters |
| `/XHeight` | number | Height of lowercase letters |
| `/StemV` | number | Dominant vertical stem width |
| `/FontFile` | stream | Type 1 PFB data |
| `/FontFile2` | stream | TrueType binary |
| `/FontFile3` | stream | CFF, OpenType, or CIDFontType0C binary (identified by stream `/Subtype`) |
### 6.2 Flags Bitfield
The `/Flags` integer is a 32-bit field; bits are numbered from 1 (LSB). Key bits:
| Bit | Mask | Meaning |
|-----|------|---------|
| 1 | 0x0001 | FixedPitch |
| 2 | 0x0002 | Serif |
| 3 | 0x0004 | Symbolic — font uses a private encoding; standard encodings do not apply |
| 4 | 0x0008 | Script (cursive) |
| 6 | 0x0020 | Nonsymbolic — font uses a standard Latin encoding |
| 7 | 0x0040 | Italic |
| 17 | 0x10000 | AllCap |
| 18 | 0x20000 | SmallCap |
| 19 | 0x40000 | ForceBold |
The `Symbolic` (bit 3) and `Nonsymbolic` (bit 6) flags are mutually exclusive and affect encoding resolution: a symbolic font's encoding is its own built-in table; a nonsymbolic font follows the standard named encoding fallback rules.
### 6.3 Inferring Unicode When CMap Data Is Absent
When both `/ToUnicode` and a useful `/Encoding` are missing, the following heuristics apply, in order:
1. If the embedded font is TrueType (`/FontFile2`) and the `/Flags` `Nonsymbolic` bit is set, use the font's `cmap` table with the `WinAnsiEncoding` assumption (platform 3, encoding 1).
2. If the font is CFF (`/FontFile3` with `/Subtype /CIDFontType0C`), the CFF `charset` table may supply glyph names; apply AGL.
3. If `/FontName` identifies a known standard font (e.g., `Symbol`, `ZapfDingbats`), apply the font-specific encoding table from PDF Annex D.
4. Otherwise, emit PUA codepoints or U+FFFD and flag the text as requiring post-processing.
The font descriptor `/FontBBox` and `/Flags` provide no path to Unicode; they are useful only for layout heuristics (detecting whitespace, line boundaries) when Unicode resolution fails.
---
## Appendix: Key Dictionary Locations
```
/Font dictionary
/Subtype → Type1 | Type3 | TrueType | Type0 | CIDFontType0 | CIDFontType2
/Encoding → name or dictionary (simple); CMap name or stream (Type0)
/ToUnicode → stream (CMap program)
/FontDescriptor → dictionary
/Flags → integer (bitfield)
/FontFile → stream (Type 1)
/FontFile2 → stream (TrueType)
/FontFile3 → stream (CFF/OpenType; /Subtype in stream dict)
/Widths → array (simple fonts)
/FirstChar → integer
/LastChar → integer
/DescendantFonts → array [ CIDFont dict ] (Type0 only)
CIDFont dictionary (inside /DescendantFonts)
/Subtype → CIDFontType0 | CIDFontType2
/CIDSystemInfo → dict (/Registry /Ordering /Supplement)
/DW → integer (default advance width)
/W → array (per-CID widths)
/CIDToGIDMap → /Identity or stream (CIDFontType2 only)
/FontDescriptor → dictionary (as above)
```
---
*Spec references: ISO 32000-1:2008 §9 (Fonts), §D (Character Sets), §M (Predefined CMaps); Adobe Glyph List Specification v1.7; Adobe Type 1 Font Format (Black Book); Adobe CMap and CIDFont Files Specification v1.0.*