# pdftract-njde: Font Fingerprint Cache (Level 3) ## Summary Implemented Level 3 of the encoding fallback chain - a font fingerprint cache that uses SHA-256 hashes of embedded font programs to look up glyph-to-Unicode mappings for known fonts. ## Implementation ### Files Created 1. **`crates/pdftract-core/build/font-fingerprints.json`** - Empty JSON array (placeholder for future font entries) - Schema: `[{ sha256_hex, font_name, entries: [[gid, codepoint], ...] }]` 2. **`crates/pdftract-core/src/font/fingerprint.rs`** - `FontFingerprint`: computes SHA-256 hash of font program bytes - `lookup_font_fingerprint()`: runtime API for single lookups - `CachedFingerprint`: cached hash for repeated lookups on the same font - Full test coverage (12 tests, all passing) ### Files Modified 1. **`crates/pdftract-core/build.rs`** - Added `generate_font_fingerprints()` function - Reads JSON, validates SHA-256 hex (64 chars), validates Unicode scalars - Generates `font_fingerprints.rs` with `phf::Map<[u8; 32], &'static [(u16, u32)]>` - Key type is `[u8; 32]` (binary digest), not `&str` (hex string) 2. **`crates/pdftract-core/src/font/mod.rs`** - Added `pub mod fingerprint;` - Exported `FontFingerprint`, `CachedFingerprint`, `lookup_font_fingerprint` ## Acceptance Criteria - ✅ **Empty JSON produces valid phf::Map**: Empty array compiles without errors - ✅ **Hash is stable across runs**: Verified with `test_hash_stability_across_runs` - ✅ **Lookup of unknown digest returns None**: Verified with multiple tests - ✅ **Binary footprint**: Empty database = negligible (~0 bytes); 200-font target = ~500KB (to be verified when populated) - ✅ **Key type is `[u8; 32]`**: Not `&str` - conversion happens at build time - ✅ **Hash computed over decoded bytes**: `FontFingerprint::compute()` takes raw decoded bytes ## Design Decisions ### Hash computed once per font Per the implementation guidance, the hash should be computed ONCE per font load and stored. The `CachedFingerprint` struct handles this - it computes the hash once, checks if it's in the database, and can be reused for multiple glyph lookups. ### Database not user-extensible at runtime The phf::Map is compile-time generated; adding entries requires editing the JSON and rebuilding. This is by design per the task requirements. ### Skip L3 for Std-14 fonts Std-14 fonts don't have embedded font programs, so the fingerprint cache is skipped for them. The `EmbeddedFont::load()` function already returns `EmptyFontMetrics` for Type1Std14 fonts. ## Test Results ``` running 12 tests test font::fingerprint::tests::test_cached_fingerprint_accessors ... ok test font::fingerprint::tests::test_cached_fingerprint_deterministic ... ok test font::fingerprint::tests::test_cached_fingerprint_reuse ... ok test font::fingerprint::tests::test_cached_fingerprint_unknown_font ... ok test font::fingerprint::tests::test_empty_database_compiles ... ok test font::fingerprint::tests::test_font_fingerprint_as_bytes ... ok test font::fingerprint::tests::test_fingerprint_different_inputs ... ok test font::fingerprint::tests::test_font_fingerprint_compute ... ok test font::fingerprint::tests::test_font_fingerprint_empty_input ... ok test font::fingerprint::tests::test_hash_stability_across_runs ... ok test font::fingerprint::tests::test_lookup_font_fingerprint_unknown_font ... ok test font::fingerprint::tests::test_lookup_font_fingerprint_different_gids ... ok test result: ok. 12 passed; 0 failed; 0 ignored ``` ## Next Steps 1. Populate `font-fingerprints.json` with real font fingerprints (commercial fonts, etc.) 2. Integrate `lookup_font_fingerprint()` into the encoding fallback chain in extract.rs 3. Measure binary footprint when populated with ~200 fonts ## References - Plan section: Phase 2.2 Level 3 (lines 1343-1352) - Dependency Matrix: `sha2` crate already approved