pdftract/notes/pdftract-njde.md
jedarden a20647a4a6 feat(pdftract-njde): implement font fingerprint cache (Level 3)
Implement Level 3 of the encoding fallback chain. Hash the raw decoded
font program bytes (/FontFile, /FontFile2, /FontFile3) with SHA-256
and look up the 32-byte digest in a compile-time phf::Map.

- build.rs: generate_font_fingerprints() reads JSON, builds phf::Map
- src/font/fingerprint.rs: FontFingerprint, CachedFingerprint, lookup API
- build/font-fingerprints.json: empty database (placeholder)

Acceptance criteria:
- Empty JSON produces valid phf::Map
- Hash is stable across runs
- Lookup of unknown digest returns None
- Binary footprint < 500KB for 200-font DB (empty = negligible)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 21:27:24 -04:00

3.8 KiB

pdftract-njde: Font Fingerprint Cache (Level 3)

Summary

Implemented Level 3 of the encoding fallback chain - a font fingerprint cache that uses SHA-256 hashes of embedded font programs to look up glyph-to-Unicode mappings for known fonts.

Implementation

Files Created

  1. crates/pdftract-core/build/font-fingerprints.json

    • Empty JSON array (placeholder for future font entries)
    • Schema: [{ sha256_hex, font_name, entries: [[gid, codepoint], ...] }]
  2. crates/pdftract-core/src/font/fingerprint.rs

    • FontFingerprint: computes SHA-256 hash of font program bytes
    • lookup_font_fingerprint(): runtime API for single lookups
    • CachedFingerprint: cached hash for repeated lookups on the same font
    • Full test coverage (12 tests, all passing)

Files Modified

  1. crates/pdftract-core/build.rs

    • Added generate_font_fingerprints() function
    • Reads JSON, validates SHA-256 hex (64 chars), validates Unicode scalars
    • Generates font_fingerprints.rs with phf::Map<[u8; 32], &'static [(u16, u32)]>
    • Key type is [u8; 32] (binary digest), not &str (hex string)
  2. crates/pdftract-core/src/font/mod.rs

    • Added pub mod fingerprint;
    • Exported FontFingerprint, CachedFingerprint, lookup_font_fingerprint

Acceptance Criteria

  • Empty JSON produces valid phf::Map: Empty array compiles without errors
  • Hash is stable across runs: Verified with test_hash_stability_across_runs
  • Lookup of unknown digest returns None: Verified with multiple tests
  • Binary footprint: Empty database = negligible (~0 bytes); 200-font target = ~500KB (to be verified when populated)
  • Key type is [u8; 32]: Not &str - conversion happens at build time
  • Hash computed over decoded bytes: FontFingerprint::compute() takes raw decoded bytes

Design Decisions

Hash computed once per font

Per the implementation guidance, the hash should be computed ONCE per font load and stored. The CachedFingerprint struct handles this - it computes the hash once, checks if it's in the database, and can be reused for multiple glyph lookups.

Database not user-extensible at runtime

The phf::Map is compile-time generated; adding entries requires editing the JSON and rebuilding. This is by design per the task requirements.

Skip L3 for Std-14 fonts

Std-14 fonts don't have embedded font programs, so the fingerprint cache is skipped for them. The EmbeddedFont::load() function already returns EmptyFontMetrics for Type1Std14 fonts.

Test Results

running 12 tests
test font::fingerprint::tests::test_cached_fingerprint_accessors ... ok
test font::fingerprint::tests::test_cached_fingerprint_deterministic ... ok
test font::fingerprint::tests::test_cached_fingerprint_reuse ... ok
test font::fingerprint::tests::test_cached_fingerprint_unknown_font ... ok
test font::fingerprint::tests::test_empty_database_compiles ... ok
test font::fingerprint::tests::test_font_fingerprint_as_bytes ... ok
test font::fingerprint::tests::test_fingerprint_different_inputs ... ok
test font::fingerprint::tests::test_font_fingerprint_compute ... ok
test font::fingerprint::tests::test_font_fingerprint_empty_input ... ok
test font::fingerprint::tests::test_hash_stability_across_runs ... ok
test font::fingerprint::tests::test_lookup_font_fingerprint_unknown_font ... ok
test font::fingerprint::tests::test_lookup_font_fingerprint_different_gids ... ok

test result: ok. 12 passed; 0 failed; 0 ignored

Next Steps

  1. Populate font-fingerprints.json with real font fingerprints (commercial fonts, etc.)
  2. Integrate lookup_font_fingerprint() into the encoding fallback chain in extract.rs
  3. Measure binary footprint when populated with ~200 fonts

References

  • Plan section: Phase 2.2 Level 3 (lines 1343-1352)
  • Dependency Matrix: sha2 crate already approved