Implement compile-time phf::Set of 20,000 common English words for
dictionary coverage scoring in readability analysis (Phase 4.7).
Key changes:
- Added wordlist-en-20k.txt (20k frequency-sorted English words)
- Extended build.rs to generate phf::Set from wordlist
- Added layout/wordlist.rs module with is_english_word() API
- Added wordlist benchmarks (< 100 ns lookup achieved)
Test results:
- All 9 unit tests pass
- Benchmarks: 13-62 ns per lookup (well under 100 ns requirement)
- Binary size: Estimated ~200-220 KB (within 250 KB limit)
Closes: pdftract-9wevc
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement Level 3 of the encoding fallback chain. Hash the raw decoded
font program bytes (/FontFile, /FontFile2, /FontFile3) with SHA-256
and look up the 32-byte digest in a compile-time phf::Map.
- build.rs: generate_font_fingerprints() reads JSON, builds phf::Map
- src/font/fingerprint.rs: FontFingerprint, CachedFingerprint, lookup API
- build/font-fingerprints.json: empty database (placeholder)
Acceptance criteria:
- Empty JSON produces valid phf::Map
- Hash is stable across runs
- Lookup of unknown digest returns None
- Binary footprint < 500KB for 200-font DB (empty = negligible)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the 6 named-encoding character-code-to-glyph-name lookup
tables required by Level 2 of the encoding fallback chain:
- WinAnsiEncoding (Windows-1252 superset of StandardEncoding)
- MacRomanEncoding (Mac OS Roman encoding)
- MacExpertEncoding (Mac OS Expert character set)
- StandardEncoding (Adobe Standard encoding)
- SymbolEncoding (Symbol font encoding)
- ZapfDingbatsEncoding (Zapf Dingbats font encoding)
These tables map character codes (0-255) to glyph names, which are then
mapped to Unicode via the Adobe Glyph List (AGL).
Acceptance criteria:
- All 6 tables compile into static arrays with binary footprint < 30 KB
- WIN_ANSI[0x92] == Some("quoteright") (canonical WinAnsi test)
- MAC_ROMAN[0xD2] == Some("quotedblleft") and MAC_ROMAN[0xD3] == Some("quotedblright")
- STANDARD[0x20] == Some("space")
- NamedEncoding::from_name("WinAnsiEncoding") == Some(NamedEncoding::WinAnsi)
Files:
- crates/pdftract-core/build/named-encodings.json - Source data from ISO 32000-1 Annex D
- crates/pdftract-core/src/font/encoding.rs - Public API with NamedEncoding enum
- crates/pdftract-core/build.rs - Build script updates for encoding generation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add build.rs that generates compile-time std14 metrics from JSON
- Add std14.rs module with Std14Metrics struct and get_std14_metrics()
- Add build/std14-metrics.json with AFM-derived widths for all 14 fonts
- Re-export Std14Metrics, NamedEncoding, get_std14_metrics in lib.rs
Acceptance criteria:
- All 14 Standard fonts (Courier, Helvetica, Times, Symbol, ZapfDingbats
and their variants) return valid metrics from the registry
- Subset-prefixed names (ABCDEF+Helvetica) resolve via strip_subset_prefix()
- Width tables match Adobe AFM data within rounding tolerance
- Binary footprint < 60 KB (generated source: 20 KB, actual data ~8 KB)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>