Four research documents covering PDF spec fundamentals, font types and encoding, glyph Unicode recovery, and tagged PDF structure/reading order. SDK invocation notes with subprocess and HTTP examples for Python, Node.js, Go, Ruby, Java, Rust, and Bash. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
300 lines
18 KiB
Markdown
300 lines
18 KiB
Markdown
# PDF Fonts and Encoding: Technical Reference for Text Extraction
|
||
|
||
This document describes every font type found in PDF files, how character codes are decoded to Unicode, and the data structures a Rust extraction engine must interpret. References are to the PDF 1.7 specification (ISO 32000-1:2008) and Adobe technical notes where applicable.
|
||
|
||
---
|
||
|
||
## 1. Font Types
|
||
|
||
### 1.1 Type 1 (Simple Font)
|
||
|
||
Type 1 fonts originate from the Adobe Type 1 format, stored as PFB (binary) or PFA (ASCII) font programs. In a PDF the font dictionary has `/Subtype /Type1`.
|
||
|
||
**Glyph storage.** The font program is a PostScript charstring program. When embedded, it appears under `/FontDescriptor` as the stream value of `/FontFile` (Type 1 binary). The charstrings are keyed by glyph name, not by a numeric glyph ID.
|
||
|
||
**Character code interpretation.** A one-byte character code from the content stream is mapped through the font's `/Encoding` to a glyph name, then the glyph name is looked up in the charstring dictionary. See §3 for encoding details.
|
||
|
||
**Widths.** The `/Widths` array (required) contains `LastChar - FirstChar + 1` entries, each giving the horizontal advance width in text-space units (1/1000 em). `/FirstChar` and `/LastChar` define the range. Codes outside this range use `/MissingWidth` from the font descriptor.
|
||
|
||
**Standard 14 fonts.** PDF readers must implement the 14 standard Type 1 fonts (Helvetica, Times-Roman, Courier, Symbol, ZapfDingbats, and their variants) without an embedded font program. These are never embedded; the reader synthesizes metrics.
|
||
|
||
### 1.2 Type 3 (Simple Font)
|
||
|
||
`/Subtype /Type3`. Glyphs are defined as PDF content streams directly in the font dictionary under `/CharProcs`, a dictionary from glyph name to stream. There is no external font program.
|
||
|
||
**Character code interpretation.** One-byte code → glyph name via `/Encoding` → content stream in `/CharProcs`. Because glyph names are arbitrary (user-defined), there is often no reliable path to Unicode without a `/ToUnicode` CMap. If `/ToUnicode` is absent, extraction must fall back to glyph name heuristics or report the text as unresolvable.
|
||
|
||
**Widths.** `/Widths`, `/FirstChar`, `/LastChar` as in Type 1. Additionally, `/FontMatrix` transforms glyph-space coordinates; the default for Type 1 is `[0.001 0 0 0.001 0 0]`, but Type 3 fonts frequently use `[1 0 0 1 0 0]` with glyph streams drawn at full size.
|
||
|
||
### 1.3 TrueType (Simple Font)
|
||
|
||
`/Subtype /TrueType`. The embedded program is a TrueType font binary under `/FontFile2` in the font descriptor.
|
||
|
||
**Glyph storage.** Glyphs are stored by integer glyph ID (GID) inside the `glyf` table. The `cmap` table maps Unicode codepoints (or platform-specific codes) to GIDs.
|
||
|
||
**Character code interpretation.** One-byte code → glyph name via `/Encoding` → GID via the font's `cmap`. When the encoding is a standard PDF encoding (WinAnsiEncoding, MacRomanEncoding, etc.), the implementation maps code → Unicode codepoint → GID using `cmap` platform/encoding subtable (platform 3, encoding 1: Windows Unicode BMP). If the font's `cmap` contains only platform 1 (Macintosh), platform-specific code mappings apply. This is a common source of extraction errors.
|
||
|
||
**Widths.** Same `/Widths` array mechanism as Type 1. The `hmtx` TrueType table provides the authoritative advance widths; the PDF `/Widths` array should match but may differ in broken documents.
|
||
|
||
### 1.4 Type 0 (Composite Font)
|
||
|
||
`/Subtype /Type0`. This is the container for multi-byte (CJK and other large character set) text. The font dictionary has:
|
||
|
||
- `/Encoding` — a CMap name (e.g., `Identity-H`) or a stream containing a CMap program.
|
||
- `/DescendantFonts` — a one-element array holding a CIDFont dictionary.
|
||
|
||
**Character code interpretation.** The multi-byte content stream codes are fed through the CMap named in `/Encoding`, which maps character codes to CIDs. The CIDFont then maps CIDs to GIDs. See §4.
|
||
|
||
**Widths.** Widths are specified in the CIDFont descendant, not in the Type 0 dictionary itself.
|
||
|
||
### 1.5 CIDFont Type 0 (CFF-Based)
|
||
|
||
`/Subtype /CIDFontType0` inside a `/DescendantFonts` array. The font program is a CFF (Compact Font Format, also called Type 2 charstrings) font embedded under `/FontFile3` with `/Subtype /CIDFontType0C` or `/Subtype /OpenType`.
|
||
|
||
**Glyph storage.** CFF stores charstrings keyed by GID (integer index). GIDs map directly to charstrings; glyph names may or may not be present depending on the CFF variant.
|
||
|
||
**Widths.** The CIDFont dictionary uses `/DW` (default width, default 1000) and `/W` (array of per-CID widths). The `/W` syntax is: an array whose elements alternate between `c [w1 w2 ...]` (individual CIDs) and `c1 c2 w` (range with uniform width).
|
||
|
||
### 1.6 CIDFont Type 2 (TrueType-Based)
|
||
|
||
`/Subtype /CIDFontType2`. The embedded program is a TrueType or OpenType/TT font under `/FontFile2` (TrueType) or `/FontFile3` with `/Subtype /OpenType`.
|
||
|
||
**CID-to-GID mapping.** The `/CIDToGIDMap` entry in the CIDFont dictionary is critical:
|
||
- If the value is the name `/Identity`, CID equals GID directly (CID = GID).
|
||
- Otherwise it is a stream of 2×65536 bytes: the GID for CID `n` is the 16-bit big-endian value at byte offset `2n`.
|
||
|
||
**Widths.** Same `/DW` and `/W` mechanism as CIDFont Type 0.
|
||
|
||
### 1.7 OpenType in PDF
|
||
|
||
OpenType fonts are embedded as `/FontFile3` streams with `/Subtype /OpenType`. An OpenType font may contain either CFF outlines (`CFF` table present → CIDFont Type 0) or TrueType outlines (`glyf` table present → CIDFont Type 2). The handling follows the respective CIDFont rules. The PDF spec does not treat OpenType as a separate subtype; it is identified by the stream subtype.
|
||
|
||
---
|
||
|
||
## 2. Encoding Mechanisms
|
||
|
||
### 2.1 Predefined Encodings
|
||
|
||
The PDF spec defines four named encodings for simple fonts (§D.1–D.4, PDF 1.7):
|
||
|
||
| Name | Character set | Typical use |
|
||
|------|--------------|-------------|
|
||
| `StandardEncoding` | 229 glyphs from the Adobe standard | Default for Type 1 fonts that omit `/Encoding` |
|
||
| `MacRomanEncoding` | Mac OS Roman 256 code points | Older Mac-generated PDFs |
|
||
| `WinAnsiEncoding` | Windows-1252 (cp1252) | Windows-generated PDFs; most common |
|
||
| `MacExpertEncoding` | Expert font character set (fractions, small caps) | Rare; expert-set fonts |
|
||
|
||
`PDFDocEncoding` is a PDF-internal encoding used for text strings in the document catalog (info dictionary, annotations) but **not** for font encoding; it must not be confused with font encodings. It extends Latin-1 by filling 0x18–0x1F and 0x80–0x9F with additional characters.
|
||
|
||
`Symbol` and `ZapfDingbats` fonts use built-in symbol encodings defined in the respective AFM files. They do **not** use the standard named encodings; their code-to-glyph mapping is private and must be looked up against the font-specific tables provided in PDF Annex D.
|
||
|
||
### 2.2 The `/Encoding` Dictionary and `/Differences` Array
|
||
|
||
When a font's `/Encoding` value is a dictionary rather than a name, the dictionary may contain:
|
||
|
||
- `/Type /Encoding` (optional)
|
||
- `/BaseEncoding` — a name (`StandardEncoding`, `MacRomanEncoding`, `WinAnsiEncoding`) designating the starting table. If absent, the base depends on font type (Type 1 defaults to built-in; others to StandardEncoding).
|
||
- `/Differences` — an array of the form `[code name code name ...]` or `[code name name name ...]`. Starting from the numeric code, each following name overrides successive slots. Example: `[32 /space /exclam /quotedbl]` overrides slots 32, 33, 34.
|
||
|
||
Encoding resolution algorithm for simple fonts:
|
||
1. Start from the BaseEncoding table.
|
||
2. Apply each `/Differences` entry, replacing the glyph name at the given code position.
|
||
3. Resolve each resulting glyph name to Unicode via the Adobe Glyph List (§5).
|
||
|
||
### 2.3 Symbol and ZapfDingbats
|
||
|
||
These two standard fonts carry the `Symbolic` flag (bit 3 of `/Flags` in the font descriptor). Their encoding is defined entirely by the glyph names in the font program; the predefined named encodings do not apply. Extraction must use the AGL or the font's own encoding vector. ZapfDingbats glyph names are documented in the PDF spec Annex D.6.
|
||
|
||
---
|
||
|
||
## 3. ToUnicode CMaps
|
||
|
||
### 3.1 CMap Stream Format
|
||
|
||
A ToUnicode CMap is a PostScript-inspired stream embedded directly in the PDF. The structure (PDF §9.10.3):
|
||
|
||
```
|
||
/CIDInit /ProcSet findresource begin
|
||
12 dict begin
|
||
begincmap
|
||
/CIDSystemInfo 3 dict dup begin
|
||
/Registry (Adobe) def
|
||
/Ordering (UCS) def
|
||
/Supplement 0 def
|
||
end def
|
||
/CMapName /Adobe-Identity-UCS def
|
||
/CMapType 2 def
|
||
4 beginbfchar
|
||
<0041> <0041> % code 0x41 → U+0041 (A)
|
||
<00A0> <00A0>
|
||
<F001> <FB01> % code 0xF001 → U+FB01 (fi ligature)
|
||
<F002> <FB02> % code 0xF002 → U+FB02 (fl ligature)
|
||
endbfchar
|
||
1 beginbfrange
|
||
<0061> <007A> <0061> % codes 0x61–0x7A → U+0061–U+007A (a–z)
|
||
endbfrange
|
||
endcmap
|
||
CMapName currentdict /CMap defineresource pop
|
||
end
|
||
end
|
||
```
|
||
|
||
**`beginbfchar` / `endbfchar`:** Each entry is a pair `<src-code> <dst-unicode>`. The destination is UTF-16BE hex bytes; a surrogate pair encodes a codepoint above U+FFFF.
|
||
|
||
**`beginbfrange` / `endbfrange`:** Range `<lo> <hi> <start-unicode>` maps a contiguous code range to a contiguous Unicode range. Alternatively, `<lo> <hi> [<u1> <u2> ...]` maps each code in the range to the corresponding Unicode string in the array.
|
||
|
||
**`begincidrange` / `endcidrange`:** Used in Type 0 CMaps (not ToUnicode) to map codes to CID ranges; see §4.
|
||
|
||
### 3.2 Embedding in PDF
|
||
|
||
The ToUnicode CMap appears as the value of the `/ToUnicode` key in the font dictionary (both simple and composite fonts). It is a stream object, usually with `/Filter /FlateDecode`.
|
||
|
||
### 3.3 When ToUnicode is Absent or Wrong
|
||
|
||
**Absent:** Extraction must fall back to encoding → glyph name → AGL lookup (simple fonts) or CID-to-Unicode tables derived from the predefined CMap ordering (composite fonts). Many PDFs produced by older tools (TeX-based pipelines, some CAD exporters) omit `/ToUnicode`; the AGL fallback is the only reliable option.
|
||
|
||
**Wrong or incomplete:** Some generators emit a `/ToUnicode` CMap with missing entries or incorrect mappings. A bfchar entry with destination `<0000>` or `<FFFD>` signals an intentionally unmapped glyph. An implementation should not blindly trust all mappings; NUL and replacement-character destinations should be treated as absent.
|
||
|
||
**Implications for extraction:** Without a `/ToUnicode` map, ligature glyphs (`fi`, `fl`, `ffi`, etc.) will be decoded as their AGL expansions (multi-character strings), which is usually correct. Private Use Area (PUA) codepoints require a `/ToUnicode` map to resolve; without one the extracted text should preserve the PUA codepoint but flag it as unresolved.
|
||
|
||
---
|
||
|
||
## 4. CID-to-GID Mapping (Composite Fonts)
|
||
|
||
### 4.1 Decoding Path
|
||
|
||
For a Type 0 composite font, the decoding pipeline is:
|
||
|
||
```
|
||
content-stream bytes
|
||
→ CMap (named in /Encoding)
|
||
→ CID
|
||
→ GID (via CIDToGIDMap or CFF index)
|
||
→ glyph outline
|
||
```
|
||
|
||
The `/Encoding` CMap converts multi-byte character codes (1–4 bytes) to CIDs. The CMap may be:
|
||
- A name referring to a predefined CMap (see §4.2).
|
||
- A stream object containing a CMap program.
|
||
|
||
### 4.2 Predefined CMaps
|
||
|
||
Adobe distributes predefined CMaps for CJK encodings (PDF Annex M). Key examples:
|
||
|
||
| Name | Script | Code space | Notes |
|
||
|------|--------|-----------|-------|
|
||
| `Identity-H` | any (horizontal) | 2-byte | CID = code (identity) |
|
||
| `Identity-V` | any (vertical) | 2-byte | CID = code, vertical writing |
|
||
| `90ms-RKSJ-H` | Japanese | Shift-JIS | Maps SJIS codes → Adobe-Japan1 CIDs |
|
||
| `GBK-EUC-H` | Simplified Chinese | GBK/EUC | Maps GBK → Adobe-GB1 CIDs |
|
||
| `UniGB-UTF16-H` | Simplified Chinese | UTF-16BE | Unicode input → Adobe-GB1 CIDs |
|
||
| `UniJIS-UTF16-H` | Japanese | UTF-16BE | Unicode input → Adobe-Japan1 CIDs |
|
||
|
||
For `Identity-H`/`Identity-V`, the CID equals the raw 2-byte code value, and if `/CIDToGIDMap /Identity`, the GID equals the CID. These are the simplest cases for TrueType-based CIDFonts.
|
||
|
||
### 4.3 CIDSystemInfo
|
||
|
||
Every CIDFont and its associated CMap must declare `/CIDSystemInfo`, a dictionary with `/Registry` (string), `/Ordering` (string), and `/Supplement` (integer). This identifies the CID character collection, e.g., Adobe-Japan1-6. The CIDFont and its CMap must share the same Registry and Ordering. Implementations should use this to select fallback Unicode tables when `/ToUnicode` is absent (Adobe publishes CID→Unicode mappings for its standard collections).
|
||
|
||
---
|
||
|
||
## 5. Glyph Name to Unicode (Adobe Glyph List)
|
||
|
||
### 5.1 The AGL
|
||
|
||
The Adobe Glyph List (AGL, `aglfn.txt`, version 1.7) maps glyph names to Unicode scalar values. An implementation should embed the AGL as a static hash table (approximately 4,000 entries).
|
||
|
||
**Algorithmic fallback** (AGL specification §2): If a glyph name is not in the AGL table:
|
||
1. Strip any trailing `.<suffix>` (e.g., `A.sc` → `A`).
|
||
2. If the name starts with `uni`, parse the following hex digits as UTF-16BE codepoint(s): `uni0041` → U+0041.
|
||
3. If the name starts with `u`, parse the following hex as a Unicode scalar: `u1F600` → U+1F600.
|
||
4. If none of the above, the glyph is unmapped.
|
||
|
||
**Ligatures.** `fi` → U+FB01, `fl` → U+FB02, `ffi` → U+FB03, `ffl` → U+FB04. These are single AGL entries mapping to single Unicode codepoints. Many extraction engines prefer to expand ligatures to their component characters (fi → "fi") for searchability; this is a policy choice, not a spec requirement.
|
||
|
||
**`.notdef`.** The glyph named `.notdef` is the fallback glyph for unmapped codes. It has no Unicode mapping. Extractors should silently skip or emit U+FFFD for `.notdef`.
|
||
|
||
**`afii` names.** Legacy glyph names starting with `afii` (e.g., `afii57506`) appear in older Arabic and Hebrew fonts. The AGL maps these to their correct Unicode codepoints; no special handling beyond AGL lookup is needed.
|
||
|
||
---
|
||
|
||
## 6. Font Descriptors
|
||
|
||
The `/FontDescriptor` dictionary (§9.8, PDF 1.7) is referenced by the font dictionary via `/FontDescriptor`. It provides metrics and the embedded font binary.
|
||
|
||
### 6.1 Key Entries
|
||
|
||
| Key | Type | Description |
|
||
|-----|------|-------------|
|
||
| `/FontName` | name | PostScript name of the font |
|
||
| `/FontBBox` | rectangle | Glyph bounding box in glyph-space units |
|
||
| `/Flags` | integer | Bitfield describing font characteristics |
|
||
| `/ItalicAngle` | number | Dominant italic angle in degrees |
|
||
| `/Ascent` | number | Maximum ascent above baseline |
|
||
| `/Descent` | number | Maximum descent below baseline (negative) |
|
||
| `/CapHeight` | number | Height of capital letters |
|
||
| `/XHeight` | number | Height of lowercase letters |
|
||
| `/StemV` | number | Dominant vertical stem width |
|
||
| `/FontFile` | stream | Type 1 PFB data |
|
||
| `/FontFile2` | stream | TrueType binary |
|
||
| `/FontFile3` | stream | CFF, OpenType, or CIDFontType0C binary (identified by stream `/Subtype`) |
|
||
|
||
### 6.2 Flags Bitfield
|
||
|
||
The `/Flags` integer is a 32-bit field; bits are numbered from 1 (LSB). Key bits:
|
||
|
||
| Bit | Mask | Meaning |
|
||
|-----|------|---------|
|
||
| 1 | 0x0001 | FixedPitch |
|
||
| 2 | 0x0002 | Serif |
|
||
| 3 | 0x0004 | Symbolic — font uses a private encoding; standard encodings do not apply |
|
||
| 4 | 0x0008 | Script (cursive) |
|
||
| 6 | 0x0020 | Nonsymbolic — font uses a standard Latin encoding |
|
||
| 7 | 0x0040 | Italic |
|
||
| 17 | 0x10000 | AllCap |
|
||
| 18 | 0x20000 | SmallCap |
|
||
| 19 | 0x40000 | ForceBold |
|
||
|
||
The `Symbolic` (bit 3) and `Nonsymbolic` (bit 6) flags are mutually exclusive and affect encoding resolution: a symbolic font's encoding is its own built-in table; a nonsymbolic font follows the standard named encoding fallback rules.
|
||
|
||
### 6.3 Inferring Unicode When CMap Data Is Absent
|
||
|
||
When both `/ToUnicode` and a useful `/Encoding` are missing, the following heuristics apply, in order:
|
||
1. If the embedded font is TrueType (`/FontFile2`) and the `/Flags` `Nonsymbolic` bit is set, use the font's `cmap` table with the `WinAnsiEncoding` assumption (platform 3, encoding 1).
|
||
2. If the font is CFF (`/FontFile3` with `/Subtype /CIDFontType0C`), the CFF `charset` table may supply glyph names; apply AGL.
|
||
3. If `/FontName` identifies a known standard font (e.g., `Symbol`, `ZapfDingbats`), apply the font-specific encoding table from PDF Annex D.
|
||
4. Otherwise, emit PUA codepoints or U+FFFD and flag the text as requiring post-processing.
|
||
|
||
The font descriptor `/FontBBox` and `/Flags` provide no path to Unicode; they are useful only for layout heuristics (detecting whitespace, line boundaries) when Unicode resolution fails.
|
||
|
||
---
|
||
|
||
## Appendix: Key Dictionary Locations
|
||
|
||
```
|
||
/Font dictionary
|
||
/Subtype → Type1 | Type3 | TrueType | Type0 | CIDFontType0 | CIDFontType2
|
||
/Encoding → name or dictionary (simple); CMap name or stream (Type0)
|
||
/ToUnicode → stream (CMap program)
|
||
/FontDescriptor → dictionary
|
||
/Flags → integer (bitfield)
|
||
/FontFile → stream (Type 1)
|
||
/FontFile2 → stream (TrueType)
|
||
/FontFile3 → stream (CFF/OpenType; /Subtype in stream dict)
|
||
/Widths → array (simple fonts)
|
||
/FirstChar → integer
|
||
/LastChar → integer
|
||
/DescendantFonts → array [ CIDFont dict ] (Type0 only)
|
||
|
||
CIDFont dictionary (inside /DescendantFonts)
|
||
/Subtype → CIDFontType0 | CIDFontType2
|
||
/CIDSystemInfo → dict (/Registry /Ordering /Supplement)
|
||
/DW → integer (default advance width)
|
||
/W → array (per-CID widths)
|
||
/CIDToGIDMap → /Identity or stream (CIDFontType2 only)
|
||
/FontDescriptor → dictionary (as above)
|
||
```
|
||
|
||
---
|
||
|
||
*Spec references: ISO 32000-1:2008 §9 (Fonts), §D (Character Sets), §M (Predefined CMaps); Adobe Glyph List Specification v1.7; Adobe Type 1 Font Format (Black Book); Adobe CMap and CIDFont Files Specification v1.0.*
|