jedarden cf8f04e3ec docs(pdftract-26r8): finalize glyph recognition research note v1.0

- Reorganize around the four-level Unicode recovery cascade from plan
- Document all cascade levels with confidence scores:
  - Level 1: ToUnicode CMap (1.0)
  - Level 2: Encoding + AGL (0.9)
  - Level 3: Font fingerprint cache (0.85)
  - Level 4: Glyph shape recognition (0.7)
- Add shape database design (pHash algorithm, query, format)
- Document pHash collision tie-break rules (frequency-based)
- Add Type 3 font handling section
- Cross-reference Phase 2.2, 2.4, 2.5 and OQ-02

File grows from 112 to 210 lines. Covers all acceptance criteria.

Closes: pdftract-26r8

2026-05-24 02:10:06 -04:00

10 KiB

Raw Blame History

Glyph Recognition and Unicode Recovery in PDF Text Extraction

Overview

PDF text extraction depends on the font's encoding machinery to map raw glyph identifiers — character codes in a content stream — to Unicode codepoints. When that machinery is absent, broken, or intentionally obscured, a robust extractor must fall back through a layered series of heuristics. This document specifies the four-level Unicode recovery cascade implemented in pdftract Phase 2.2, the Type 3 fallback in Phase 2.4, and the shape database design from Phase 2.5.

The Four-Level Recovery Cascade

Each level attempts to recover Unicode from character codes, proceeding from highest-confidence to lowest-confidence. The first non-empty result wins. All levels emit a confidence score in [0, 1] and a unicode_source tag for diagnostics.

Level 1: ToUnicode CMap (confidence = 1.0)

Happy path — the font explicitly declares the mapping.

Parse the /ToUnicode stream as a PDF CMap program. The CMap syntax to implement:

beginbfchar / endbfchar: Single-character mappings. Format: <srcCode> <dstHex> where <dstHex> may be a UTF-16BE multi-codepoint sequence (e.g., fi ligature → <00660069>).
beginbfrange / endbfrange: Range mappings. Two forms:
- <lo> <hi> <dst>: Contiguous range where each code maps to an incrementing Unicode codepoint.
- <lo> <hi> [<d0> <d1> ...]: Explicit array for non-contiguous targets.
usecmap: Inherit from a named CMap (e.g., Adobe-Japan1-UCS2).
Comments (%) stripped.

Result: unicode_source = "to_unicode", confidence = 1.0.
Fall-through: If the CMap maps a code to U+FFFD or U+0000, treat as missing and proceed to Level 2.

Entry point: pdftract-core::font::resolve_to_unicode() (Phase 2.2).

Level 2: Encoding Vector + AGL (confidence = 0.9)

Second-most-common path — glyph names are available via the font's /Encoding.

Map character code → glyph name via the font's /Encoding dictionary:

Named encodings (hardcoded tables):
- WinAnsiEncoding — Windows ANSI (superset of ISO-8859-1)
- MacRomanEncoding — classic Mac OS Roman
- MacExpertEncoding — expert set for old-style figures
- StandardEncoding — PDF standard encoding
- SymbolEncoding — Symbol font (see note below)
- ZapfDingbatsEncoding — ZapfDingbats font (see note below)
/Differences array: Sparse overlay on base encoding. Format: [n /GlyphName1 /GlyphName2 ...] where n is the starting code position.

Map glyph name → Unicode via the Adobe Glyph List (AGL 1.4) algorithm:

Direct AGL table lookup (~4,400 entries, compiled as a static phf::Map).
If name is uniXXXX (exactly four uppercase hex digits), return U+XXXX. Multiple consecutive uniXXXX segments encode a sequence (ligatures).
If name is uXXXXXX (four to six uppercase hex digits), return U+XXXXXX, provided the codepoint is valid (not surrogate, not above U+10FFFF).
If name contains a period (.), strip suffix and retry on base name.
Otherwise, unrecognized → fall through to Level 3.

Special cases: ZapfDingbats and Symbol have their own mappings defined in ISO 32000-2 Section 9.10.2. Do NOT apply AGL to these fonts.

Result: unicode_source = "agl", confidence = 0.9.
Entry point: pdftract-core::font::resolve_agl() (Phase 2.2).

Level 3: Font Fingerprint Cache (confidence = 0.85)

Known-font database — identify the embedded font and use its standard mapping.

Hash the embedded font program: SHA-256 of the raw font program stream bytes (the decoded /FontFile, /FontFile2, or /FontFile3 stream). Look up in a bundled database of known font checksums → per-glyph Unicode mapping tables.

Database spec:

Compile-time phf::Map<[u8; 32], &'static [(u16, char)]>
Key: 32-byte SHA-256 of raw font program bytes
Value: Slice of (glyph_id, unicode_char) pairs covering every mapped glyph
Generated from build/font-fingerprints.json via build.rs using phf_codegen
Binary footprint: ~500 KB (approved allocation within 4 MB budget)

Curation pipeline: See OQ-02 (Open Questions) and docs/research/font-fingerprinting.md. Initially populated with ~200 common commercial fonts from Adobe and Google Fonts metric data.

Guard: If the font has no embedded program (Standard-14 fonts or no /FontFile*), skip Level 3 and proceed to Level 4.

Result: unicode_source = "fingerprint", confidence = 0.85.
Entry point: pdftract-core::font::resolve_fingerprint() (Phase 2.2).

Level 4: Glyph Shape Recognition (confidence = 0.7)

Fallback — match the rendered glyph shape against a pre-computed database.

Render the glyph to a 32×32 grayscale bitmap and compute a perceptual hash. Look up in a bundled shape→Unicode database.

Perceptual hash algorithm (pHash):

Rasterize glyph to 32×32 grayscale bitmap:
- TrueType/OpenType: use fontdue rasterizer
- Type 3 glyphs: use Type 3 content stream renderer (Phase 2.4)
Apply 32×32 Discrete Cosine Transform (DCT)
Retain top-left 8×8 AC coefficients (64 values)
Threshold against median of those 64 values → 64-bit integer hash

This yields a scale-invariant hash robust to minor rendering differences.

Database format:

Compile-time &'static [(u64, char)] — sorted slice of (pHash, char) pairs
Generated from build/glyph-shapes.json via build.rs (emitted as static array)
NOT phf::Map because we need nearest-neighbor scan, not exact lookup
Binary footprint: ~300 KB for ~5,000 common glyphs (Latin, Greek, Cyrillic, extended Latin)

Query algorithm:

Linear scan over all entries computing (query_hash XOR entry_hash).count_ones()
Collect entries with Hamming distance ≤ 8
Select entry with smallest distance
Tie-break: Use Unicode frequency rank from companion table (&'static [(u64, u32)] sorted by pHash)
If no entry within threshold, fall through to failure

Performance: 5,000 entries × ~8 ns per XOR+popcount ≈ 40 µs worst-case scan.

Tie-break rules for visually similar glyphs:

When pHash distance is tied or ambiguous (e.g., l vs I vs |, O vs 0):

Prefer digits if previous span resolved to digits (monospaced font context)
Prefer lowercase letters if surrounding text is mostly lowercase
Use Unicode frequency rank as final tie-breaker (common chars preferred)

Result: unicode_source = "shape_match", confidence = 0.7.
Entry point: pdftract-core::font::resolve_shape_match() (Phase 2.2) and Type 3 fallback (Phase 2.4).

Confidence Scoring Formula

Each level emits a confidence score:

Level	Source	Confidence
1	ToUnicode CMap	1.0
2	AGL lookup	0.9
3	Font fingerprint	0.85
4	Shape match	0.7

Cascade behavior: The first non-empty result wins. Confidence is emitted in diagnostic metadata but does NOT override cascade priority.

Post-cascade context rescoring (optional): For results below 0.8 confidence, apply character n-gram or dictionary-based validation using surrounding high-confidence characters. This can downgrade ambiguous matches to U+FFFD with a SHAPE_AMBIGUOUS diagnostic.

Type 3 Font Handling

Type 3 fonts define each glyph as a PDF content stream in /CharProcs. The same four-level cascade applies:

Check /ToUnicode first (Level 1)
If absent, attempt /Encoding glyph name lookup (Level 2)
If glyph name is non-standard (arbitrary user name), rasterize the content stream to 32×32 bitmap and apply shape recognition (Level 4)

Type 3 rasterization (Phase 2.4):

Execute the glyph's content stream as a constrained sub-content-stream
Track graphics state (CTM, fill/stroke state) using the Phase 3 graphics state machine
Record stroke/fill operations to a 32×32 grayscale bitmap
Support operators: m l c v y (path construction), h S s f F B b f* B* b* (stroke/fill), q Q cm (state), Do (form XObject, recursive), re (rectangle)
Stack depth limit: 20 levels (same as form XObject limit)

Entry point: pdftract-core::font::rasterize_type3_glyph() (Phase 2.4).

Database Licensing and Provenance

Font fingerprint database (build/font-fingerprints.json):

Source: Adobe's public font databases, Google Fonts cmap metric exports
License: Fonts used are SIL Open Font License or similar permissive licenses
Curation: Maintainer-owned; see OQ-02 and docs/research/font-fingerprinting.md

Glyph shape database (build/glyph-shapes.json):

Source: Glyph bitmaps rendered from open-source fonts (Google Fonts corpus, SIL OFL fonts)
License: SIL Open Font License fonts are free of Unicode licensing entanglements
Curation: Offline hash pipeline; JSON is the authoritative artifact

Reprocibility: Same glyph shape MUST always hash to the same bucket and yield the same Unicode value. No float nondeterminism, no random seeds.

Failure Mode

If all four levels fail:

Emit U+FFFD (REPLACEMENT CHARACTER)
Set unicode_source = "unknown", confidence = 0.0
Log GLYPH_UNMAPPED diagnostic with font ID and character code

Cross-References

Phase 2.2 (font recognition coordinator): Entry point for the four-level cascade
Phase 2.4 (Type 3 charstring renderer): Type 3 fallback to Level 4
Phase 2.5 (shape database bundling): Build artifact generation
OQ-02 (plan line 513): Font-fingerprint database curation pipeline ownership

References

ISO 32000-2:2020 (PDF 2.0), Section 9 (Text) and Annex D (Character Sets and Encodings)
Adobe Glyph List Specification, version 1.7 — adobe-type-tools/agl-specification
Adobe Glyph List for New Fonts (aglfn) — adobe-type-tools/agl-aglfn
Adobe Type 1 Font Format specification (Black Book), Chapter 6 (Charstrings)
Apple TrueType Reference Manual — glyf table specification
OpenType Specification 1.9, Microsoft Typography (CFF / CFF2 charstring formats)
Unicode Standard Annex #29 (Unicode Text Segmentation)
pdftract plan: Phase 2.2 (line 1340), Phase 2.4 (line 1416), Phase 2.5 (line 1434)

10 KiB Raw Blame History Unescape Escape