- Reorganize around the four-level Unicode recovery cascade from plan - Document all cascade levels with confidence scores: - Level 1: ToUnicode CMap (1.0) - Level 2: Encoding + AGL (0.9) - Level 3: Font fingerprint cache (0.85) - Level 4: Glyph shape recognition (0.7) - Add shape database design (pHash algorithm, query, format) - Document pHash collision tie-break rules (frequency-based) - Add Type 3 font handling section - Cross-reference Phase 2.2, 2.4, 2.5 and OQ-02 File grows from 112 to 210 lines. Covers all acceptance criteria. Closes: pdftract-26r8
10 KiB
Glyph Recognition and Unicode Recovery in PDF Text Extraction
Overview
PDF text extraction depends on the font's encoding machinery to map raw glyph identifiers — character codes in a content stream — to Unicode codepoints. When that machinery is absent, broken, or intentionally obscured, a robust extractor must fall back through a layered series of heuristics. This document specifies the four-level Unicode recovery cascade implemented in pdftract Phase 2.2, the Type 3 fallback in Phase 2.4, and the shape database design from Phase 2.5.
The Four-Level Recovery Cascade
Each level attempts to recover Unicode from character codes, proceeding from highest-confidence to lowest-confidence. The first non-empty result wins. All levels emit a confidence score in [0, 1] and a unicode_source tag for diagnostics.
Level 1: ToUnicode CMap (confidence = 1.0)
Happy path — the font explicitly declares the mapping.
Parse the /ToUnicode stream as a PDF CMap program. The CMap syntax to implement:
beginbfchar/endbfchar: Single-character mappings. Format:<srcCode> <dstHex>where<dstHex>may be a UTF-16BE multi-codepoint sequence (e.g.,filigature →<00660069>).beginbfrange/endbfrange: Range mappings. Two forms:<lo> <hi> <dst>: Contiguous range where each code maps to an incrementing Unicode codepoint.<lo> <hi> [<d0> <d1> ...]: Explicit array for non-contiguous targets.
usecmap: Inherit from a named CMap (e.g.,Adobe-Japan1-UCS2).- Comments (
%) stripped.
Result: unicode_source = "to_unicode", confidence = 1.0.
Fall-through: If the CMap maps a code to U+FFFD or U+0000, treat as missing and proceed to Level 2.
Entry point: pdftract-core::font::resolve_to_unicode() (Phase 2.2).
Level 2: Encoding Vector + AGL (confidence = 0.9)
Second-most-common path — glyph names are available via the font's /Encoding.
Map character code → glyph name via the font's /Encoding dictionary:
-
Named encodings (hardcoded tables):
WinAnsiEncoding— Windows ANSI (superset of ISO-8859-1)MacRomanEncoding— classic Mac OS RomanMacExpertEncoding— expert set for old-style figuresStandardEncoding— PDF standard encodingSymbolEncoding— Symbol font (see note below)ZapfDingbatsEncoding— ZapfDingbats font (see note below)
-
/Differencesarray: Sparse overlay on base encoding. Format:[n /GlyphName1 /GlyphName2 ...]wherenis the starting code position.
Map glyph name → Unicode via the Adobe Glyph List (AGL 1.4) algorithm:
- Direct AGL table lookup (~4,400 entries, compiled as a static
phf::Map). - If name is
uniXXXX(exactly four uppercase hex digits), return U+XXXX. Multiple consecutiveuniXXXXsegments encode a sequence (ligatures). - If name is
uXXXXXX(four to six uppercase hex digits), return U+XXXXXX, provided the codepoint is valid (not surrogate, not above U+10FFFF). - If name contains a period (
.), strip suffix and retry on base name. - Otherwise, unrecognized → fall through to Level 3.
Special cases: ZapfDingbats and Symbol have their own mappings defined in ISO 32000-2 Section 9.10.2. Do NOT apply AGL to these fonts.
Result: unicode_source = "agl", confidence = 0.9.
Entry point: pdftract-core::font::resolve_agl() (Phase 2.2).
Level 3: Font Fingerprint Cache (confidence = 0.85)
Known-font database — identify the embedded font and use its standard mapping.
Hash the embedded font program: SHA-256 of the raw font program stream bytes (the decoded /FontFile, /FontFile2, or /FontFile3 stream). Look up in a bundled database of known font checksums → per-glyph Unicode mapping tables.
Database spec:
- Compile-time
phf::Map<[u8; 32], &'static [(u16, char)]> - Key: 32-byte SHA-256 of raw font program bytes
- Value: Slice of
(glyph_id, unicode_char)pairs covering every mapped glyph - Generated from
build/font-fingerprints.jsonviabuild.rsusingphf_codegen - Binary footprint: ~500 KB (approved allocation within 4 MB budget)
Curation pipeline: See OQ-02 (Open Questions) and docs/research/font-fingerprinting.md. Initially populated with ~200 common commercial fonts from Adobe and Google Fonts metric data.
Guard: If the font has no embedded program (Standard-14 fonts or no /FontFile*), skip Level 3 and proceed to Level 4.
Result: unicode_source = "fingerprint", confidence = 0.85.
Entry point: pdftract-core::font::resolve_fingerprint() (Phase 2.2).
Level 4: Glyph Shape Recognition (confidence = 0.7)
Fallback — match the rendered glyph shape against a pre-computed database.
Render the glyph to a 32×32 grayscale bitmap and compute a perceptual hash. Look up in a bundled shape→Unicode database.
Perceptual hash algorithm (pHash):
- Rasterize glyph to 32×32 grayscale bitmap:
- TrueType/OpenType: use
fontduerasterizer - Type 3 glyphs: use Type 3 content stream renderer (Phase 2.4)
- TrueType/OpenType: use
- Apply 32×32 Discrete Cosine Transform (DCT)
- Retain top-left 8×8 AC coefficients (64 values)
- Threshold against median of those 64 values → 64-bit integer hash
This yields a scale-invariant hash robust to minor rendering differences.
Database format:
- Compile-time
&'static [(u64, char)]— sorted slice of(pHash, char)pairs - Generated from
build/glyph-shapes.jsonviabuild.rs(emitted asstaticarray) - NOT
phf::Mapbecause we need nearest-neighbor scan, not exact lookup - Binary footprint: ~300 KB for ~5,000 common glyphs (Latin, Greek, Cyrillic, extended Latin)
Query algorithm:
- Linear scan over all entries computing
(query_hash XOR entry_hash).count_ones() - Collect entries with Hamming distance ≤ 8
- Select entry with smallest distance
- Tie-break: Use Unicode frequency rank from companion table (
&'static [(u64, u32)]sorted by pHash) - If no entry within threshold, fall through to failure
Performance: 5,000 entries × ~8 ns per XOR+popcount ≈ 40 µs worst-case scan.
Tie-break rules for visually similar glyphs:
When pHash distance is tied or ambiguous (e.g., l vs I vs |, O vs 0):
- Prefer digits if previous span resolved to digits (monospaced font context)
- Prefer lowercase letters if surrounding text is mostly lowercase
- Use Unicode frequency rank as final tie-breaker (common chars preferred)
Result: unicode_source = "shape_match", confidence = 0.7.
Entry point: pdftract-core::font::resolve_shape_match() (Phase 2.2) and Type 3 fallback (Phase 2.4).
Confidence Scoring Formula
Each level emits a confidence score:
| Level | Source | Confidence |
|---|---|---|
| 1 | ToUnicode CMap | 1.0 |
| 2 | AGL lookup | 0.9 |
| 3 | Font fingerprint | 0.85 |
| 4 | Shape match | 0.7 |
Cascade behavior: The first non-empty result wins. Confidence is emitted in diagnostic metadata but does NOT override cascade priority.
Post-cascade context rescoring (optional): For results below 0.8 confidence, apply character n-gram or dictionary-based validation using surrounding high-confidence characters. This can downgrade ambiguous matches to U+FFFD with a SHAPE_AMBIGUOUS diagnostic.
Type 3 Font Handling
Type 3 fonts define each glyph as a PDF content stream in /CharProcs. The same four-level cascade applies:
- Check
/ToUnicodefirst (Level 1) - If absent, attempt
/Encodingglyph name lookup (Level 2) - If glyph name is non-standard (arbitrary user name), rasterize the content stream to 32×32 bitmap and apply shape recognition (Level 4)
Type 3 rasterization (Phase 2.4):
- Execute the glyph's content stream as a constrained sub-content-stream
- Track graphics state (CTM, fill/stroke state) using the Phase 3 graphics state machine
- Record stroke/fill operations to a 32×32 grayscale bitmap
- Support operators:
m l c v y(path construction),h S s f F B b f* B* b*(stroke/fill),q Q cm(state),Do(form XObject, recursive),re(rectangle) - Stack depth limit: 20 levels (same as form XObject limit)
Entry point: pdftract-core::font::rasterize_type3_glyph() (Phase 2.4).
Database Licensing and Provenance
Font fingerprint database (build/font-fingerprints.json):
- Source: Adobe's public font databases, Google Fonts
cmapmetric exports - License: Fonts used are SIL Open Font License or similar permissive licenses
- Curation: Maintainer-owned; see OQ-02 and
docs/research/font-fingerprinting.md
Glyph shape database (build/glyph-shapes.json):
- Source: Glyph bitmaps rendered from open-source fonts (Google Fonts corpus, SIL OFL fonts)
- License: SIL Open Font License fonts are free of Unicode licensing entanglements
- Curation: Offline hash pipeline; JSON is the authoritative artifact
Reprocibility: Same glyph shape MUST always hash to the same bucket and yield the same Unicode value. No float nondeterminism, no random seeds.
Failure Mode
If all four levels fail:
- Emit U+FFFD (REPLACEMENT CHARACTER)
- Set
unicode_source = "unknown",confidence = 0.0 - Log
GLYPH_UNMAPPEDdiagnostic with font ID and character code
Cross-References
- Phase 2.2 (font recognition coordinator): Entry point for the four-level cascade
- Phase 2.4 (Type 3 charstring renderer): Type 3 fallback to Level 4
- Phase 2.5 (shape database bundling): Build artifact generation
- OQ-02 (plan line 513): Font-fingerprint database curation pipeline ownership
References
- ISO 32000-2:2020 (PDF 2.0), Section 9 (Text) and Annex D (Character Sets and Encodings)
- Adobe Glyph List Specification, version 1.7 —
adobe-type-tools/agl-specification - Adobe Glyph List for New Fonts (aglfn) —
adobe-type-tools/agl-aglfn - Adobe Type 1 Font Format specification (Black Book), Chapter 6 (Charstrings)
- Apple TrueType Reference Manual —
glyftable specification - OpenType Specification 1.9, Microsoft Typography (CFF / CFF2 charstring formats)
- Unicode Standard Annex #29 (Unicode Text Segmentation)
pdftractplan: Phase 2.2 (line 1340), Phase 2.4 (line 1416), Phase 2.5 (line 1434)