Commit graph

8 commits

Author SHA1 Message Date
jedarden
4f651ca9b8 feat(bf-1vv5n): add Roboto font fingerprint entries to font-fingerprints.json
- Add SHA-256 hash of Roboto-Regular.ttf (56a45233d29f11b4dfb86d248e921939d115778f87325e7ae8cc108383d6664d)
- Map glyph IDs 1-95 to ASCII codepoints 32-126 (space through tilde)
- Enables Level 3 Unicode recovery via font fingerprint matching
- Verify: cargo build -p pdftract-core passes, checksum verified

Closes bf-1vv5n.
2026-06-08 20:31:30 -04:00
jedarden
54fe6c1964 feat(pdftract-1xf4d): implement TH-06 supply-chain gate
- Add minimum version requirements to deny.toml (ring >= 0.17.5, rustls >= 0.23)
- Create build/CHECKSUMS.sha256 for build-time data file integrity
- Update build.rs to verify checksums on every build
- Add tampering detection tests (th06_checksum_test.rs)
- Create nightly supply-chain scan workflow (pdftract-nightly-supply-chain.yaml)
- Update audit.toml with advisory exceptions

Closes: pdftract-1xf4d
Refs: plan lines 877, 883-896, 906-913
2026-05-26 17:31:13 -04:00
jedarden
b96c3bfd37 feat(pdftract-9wevc): implement 20k English wordlist for readability scoring
Implement compile-time phf::Set of 20,000 common English words for
dictionary coverage scoring in readability analysis (Phase 4.7).

Key changes:
- Added wordlist-en-20k.txt (20k frequency-sorted English words)
- Extended build.rs to generate phf::Set from wordlist
- Added layout/wordlist.rs module with is_english_word() API
- Added wordlist benchmarks (< 100 ns lookup achieved)

Test results:
- All 9 unit tests pass
- Benchmarks: 13-62 ns per lookup (well under 100 ns requirement)
- Binary size: Estimated ~200-220 KB (within 250 KB limit)

Closes: pdftract-9wevc

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:29:13 -04:00
jedarden
f804887a86 feat(pdftract-43ry): implement predefined CMap registry
Implement a registry of the 9 named CMaps PDF readers MUST support
without an embedded CMap stream: Identity-H, Identity-V, and 8 UTF16
CMaps (UniJIS-UTF16-H/V, UniGB-UTF16-H/V, UniCNS-UTF16-H/V,
UniKS-UTF16-H/V).

- Added PredefinedCMap struct with name, is_vertical, collection fields
- from_name() resolves all 10 predefined CMap names
- decode_bytes() reads 2-byte big-endian codes as CIDs
- cid_to_unicode() maps CIDs to Unicode codepoints (None for Identity-H/V)
- Build-time generation of PHF maps from JSON files
- Feature flag 'cjk' controls ~1.2 MB UCS2 map inclusion (default off)

Acceptance criteria:
- All 10 names resolve via from_name()
- Identity-H decodes [0x00, 0x41] to CID 65
- UniJIS-UTF16-H decodes CID 236 to U+3042 (あ)
- Vertical (V) variant returns identical CID->Unicode as Horizontal (H)
- Unknown name returns None
- Feature flag 'cjk' controls UCS2 map inclusion

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:00:59 -04:00
jedarden
a20647a4a6 feat(pdftract-njde): implement font fingerprint cache (Level 3)
Implement Level 3 of the encoding fallback chain. Hash the raw decoded
font program bytes (/FontFile, /FontFile2, /FontFile3) with SHA-256
and look up the 32-byte digest in a compile-time phf::Map.

- build.rs: generate_font_fingerprints() reads JSON, builds phf::Map
- src/font/fingerprint.rs: FontFingerprint, CachedFingerprint, lookup API
- build/font-fingerprints.json: empty database (placeholder)

Acceptance criteria:
- Empty JSON produces valid phf::Map
- Hash is stable across runs
- Lookup of unknown digest returns None
- Binary footprint < 500KB for 200-font DB (empty = negligible)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 21:27:24 -04:00
jedarden
566cac2aea feat(pdftract-28m6): implement AGL compile-time phf::Map
Add Adobe Glyph List (AGL) 1.4 and AGLFN 1.7 compile-time lookup using phf::Map.

- Add generate_agl.py to parse AGL source files and generate agl.json
- Add aglfn.txt (AGLFN 1.7, ~770 entries) and glyphlist.txt (AGL 1.4, ~4400 entries)
- Add build.rs function to generate two phf::Map structures:
  - AGL: 4,200 single-codepoint entries
  - AGL_MULTI: 81 multi-codepoint entries (Hebrew/Arabic)
- Add src/font/agl.rs with public API:
  - unicode_for_glyph_name() - handles algorithmic patterns (uniXXXX, uXXXXXX), variant stripping, AGL lookup
  - unicode_for_glyph_name_multi() - for multi-codepoint ligatures

All 21 acceptance criteria tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 18:44:47 -04:00
jedarden
09c3498cf4 feat(pdftract-3dwu): implement named encoding tables
Implements the 6 named-encoding character-code-to-glyph-name lookup
tables required by Level 2 of the encoding fallback chain:
- WinAnsiEncoding (Windows-1252 superset of StandardEncoding)
- MacRomanEncoding (Mac OS Roman encoding)
- MacExpertEncoding (Mac OS Expert character set)
- StandardEncoding (Adobe Standard encoding)
- SymbolEncoding (Symbol font encoding)
- ZapfDingbatsEncoding (Zapf Dingbats font encoding)

These tables map character codes (0-255) to glyph names, which are then
mapped to Unicode via the Adobe Glyph List (AGL).

Acceptance criteria:
- All 6 tables compile into static arrays with binary footprint < 30 KB
- WIN_ANSI[0x92] == Some("quoteright") (canonical WinAnsi test)
- MAC_ROMAN[0xD2] == Some("quotedblleft") and MAC_ROMAN[0xD3] == Some("quotedblright")
- STANDARD[0x20] == Some("space")
- NamedEncoding::from_name("WinAnsiEncoding") == Some(NamedEncoding::WinAnsi)

Files:
- crates/pdftract-core/build/named-encodings.json - Source data from ISO 32000-1 Annex D
- crates/pdftract-core/src/font/encoding.rs - Public API with NamedEncoding enum
- crates/pdftract-core/build.rs - Build script updates for encoding generation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 18:00:05 -04:00
jedarden
7429a67d08 feat(pdftract-juc): implement Standard 14 font metrics registry
- Add build.rs that generates compile-time std14 metrics from JSON
- Add std14.rs module with Std14Metrics struct and get_std14_metrics()
- Add build/std14-metrics.json with AFM-derived widths for all 14 fonts
- Re-export Std14Metrics, NamedEncoding, get_std14_metrics in lib.rs

Acceptance criteria:
- All 14 Standard fonts (Courier, Helvetica, Times, Symbol, ZapfDingbats
  and their variants) return valid metrics from the registry
- Subset-prefixed names (ABCDEF+Helvetica) resolve via strip_subset_prefix()
- Width tables match Adobe AFM data within rounding tolerance
- Binary footprint < 60 KB (generated source: 20 KB, actual data ~8 KB)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 14:04:02 -04:00