# Glyph Shape Database Generation ## Overview The `cargo xtask gen-shape-db` command generates a perceptual hash (pHash) database from TrueType/OpenType font files. This database is used for glyph shape recognition in PDF text extraction. ## Usage ```bash # From workspace root cargo xtask gen-shape-db [output-path] # Example cargo xtask gen-shape-db /path/to/fonts build/glyph-shapes.json ``` ## Arguments - `fonts-dir`: Path to directory containing `.ttf` or `.otf` font files (recursively searched) - `output-path`: Optional output path (default: `build/glyph-shapes.json`) ## Font Requirements Fonts MUST be open-licensed: - Google Fonts (Apache 2.0 / OFL) - SIL Open Font License fonts - Other permissive licenses compatible with PDF extraction ## Output Format The output is a JSON array of glyph entries: ```json [ { "phash_hex": "0123456789abcdef", "char": "A", "source_font": "LiberationSans-Regular.ttf", "frequency_rank": 30 }, ... ] ``` ## Character Frequency The command reads `build/frequency.json` for character frequency rankings. If not found, all characters are assigned rank 0. Format: `{"A": 30, "B": 47, ...}` where higher values = more common. ## Suggested Fonts For comprehensive coverage, use these open-licensed fonts: - Liberation Sans - DejaVu Sans - Source Code Pro - Noto Sans (covers Latin, Greek, Cyrillic) - Roboto ## Example Setup ```bash # Download Google Fonts git clone https://github.com/google/fonts.git /tmp/fonts # Generate database cargo xtask gen-shape-db /tmp/fonts/ofl/liberationsans build/glyph-shapes.json # Expected: ~5000 glyphs covering Latin, Greek, Cyrillic, symbols ``` ## License Attribution Font license texts should be stored in `build/font-licenses/` with a README.md documenting the source and license terms for each font used. ## Algorithm 1. Load each font file using fontdue 2. For each Unicode codepoint (0x0000-0xFFFF): - Check if font has a glyph for the character - Rasterize at 32x32 pixels - Center the bitmap on a 32x32 canvas - Compute pHash via 32x32 DCT → 8x8 low-freq coefficients → median threshold 3. Deduplicate by (pHash, char) pairs 4. Handle cross-character collisions by keeping higher-frequency character 5. Sort by pHash ascending and output JSON ## Determinism The output is byte-identical when re-run on the same input fonts and frequency data.