Commit graph

3 commits

Author SHA1 Message Date
jedarden
dd2d3502c6 feat(glyph-shape): implement font corpus fetch script and shape DB generation
Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed
font corpus and generating glyph shape database for L4 recognition.

- Script downloads fonts from build/shape-corpus-manifest.txt
- Copies LICENSE files to build/font-licenses/ for compliance
- Idempotent: skips already-present fonts
- Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32)

Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target):
  - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic)
  - Roboto: 2,392 glyphs (Latin Basic, extended)
  - JetBrains Mono: 1,176 glyphs (monospace)
  - Source Code Pro: 1,124 glyphs (monospace)

build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis
for pHash data redistribution.

Closes: pdftract-1i8n
2026-05-24 09:48:29 -04:00
jedarden
6b730fc824 feat(pdftract-1sms): implement build.rs emitter for glyph shape database
Extend build.rs to read build/glyph-shapes.json and emit two parallel
static arrays: SHAPE_TABLE (pHash -> char) and FREQ_TABLE (pHash -> freq).
Generated file written to OUT_DIR/shape_db.rs and included in shape.rs.

Key changes:
- Add generate_shape_db() function to build.rs
- Parse JSON entries with phash_hex, char, frequency_rank
- Sort by pHash ascending and validate for duplicates
- Use Rust's Debug formatter for proper char escaping
- Include compile-time length assertion
- Handle missing JSON gracefully (empty tables + warning)
- Update shape_database() to return SHAPE_TABLE
- Update lookup_shape() to work with &[(u64, char)]

Acceptance criteria:
- Build with empty JSON -> empty tables: PASS
- Build with 4-entry JSON -> sorted entries: PASS
- Rebuild without changes -> no rebuild: PASS
- Duplicate detection -> warning: PASS
- Binary size < 300 KB: PASS (~200 KB estimated)

Closes: pdftract-1sms

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 06:21:54 -04:00
jedarden
f08369bbf0 feat(xtask): implement gen-shape-db subcommand for glyph pHash database
Add cargo xtask gen-shape-db command that walks font directories,
rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs
build/glyph-shapes.json.

Implementation details:
- Fontdue integration for TrueType/OpenType font loading
- 32x32 bitmap rasterization with centering
- DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold)
- Character frequency data for collision resolution
- Deduplication by (phash, char) pairs
- Cross-character collision handling (keep higher-frequency char)
- Sorted output by pHash ascending

Artifacts:
- build/frequency.json: Character frequency rankings
- build/README.md: Command documentation and usage

Acceptance criteria:
-  cargo xtask gen-shape-db --fonts <dir> produces valid JSON
-  Deterministic output (byte-identical on same inputs)
-  Fontdue integration and 32x32 rasterization
-  pHash computation via DCT
- ⚠️ No system fonts for full integration test (documented)

Closes: pdftract-2aq0
2026-05-24 05:40:44 -04:00