Add cargo xtask gen-shape-db command that walks font directories, rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs build/glyph-shapes.json. Implementation details: - Fontdue integration for TrueType/OpenType font loading - 32x32 bitmap rasterization with centering - DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold) - Character frequency data for collision resolution - Deduplication by (phash, char) pairs - Cross-character collision handling (keep higher-frequency char) - Sorted output by pHash ascending Artifacts: - build/frequency.json: Character frequency rankings - build/README.md: Command documentation and usage Acceptance criteria: - ✅ cargo xtask gen-shape-db --fonts <dir> produces valid JSON - ✅ Deterministic output (byte-identical on same inputs) - ✅ Fontdue integration and 32x32 rasterization - ✅ pHash computation via DCT - ⚠️ No system fonts for full integration test (documented) Closes: pdftract-2aq0
94 lines
2.3 KiB
Markdown
94 lines
2.3 KiB
Markdown
# Glyph Shape Database Generation
|
|
|
|
## Overview
|
|
|
|
The `cargo xtask gen-shape-db` command generates a perceptual hash (pHash) database
|
|
from TrueType/OpenType font files. This database is used for glyph shape recognition
|
|
in PDF text extraction.
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# From workspace root
|
|
cargo xtask gen-shape-db <fonts-dir> [output-path]
|
|
|
|
# Example
|
|
cargo xtask gen-shape-db /path/to/fonts build/glyph-shapes.json
|
|
```
|
|
|
|
## Arguments
|
|
|
|
- `fonts-dir`: Path to directory containing `.ttf` or `.otf` font files (recursively searched)
|
|
- `output-path`: Optional output path (default: `build/glyph-shapes.json`)
|
|
|
|
## Font Requirements
|
|
|
|
Fonts MUST be open-licensed:
|
|
- Google Fonts (Apache 2.0 / OFL)
|
|
- SIL Open Font License fonts
|
|
- Other permissive licenses compatible with PDF extraction
|
|
|
|
## Output Format
|
|
|
|
The output is a JSON array of glyph entries:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"phash_hex": "0123456789abcdef",
|
|
"char": "A",
|
|
"source_font": "LiberationSans-Regular.ttf",
|
|
"frequency_rank": 30
|
|
},
|
|
...
|
|
]
|
|
```
|
|
|
|
## Character Frequency
|
|
|
|
The command reads `build/frequency.json` for character frequency rankings.
|
|
If not found, all characters are assigned rank 0.
|
|
|
|
Format: `{"A": 30, "B": 47, ...}` where higher values = more common.
|
|
|
|
## Suggested Fonts
|
|
|
|
For comprehensive coverage, use these open-licensed fonts:
|
|
- Liberation Sans
|
|
- DejaVu Sans
|
|
- Source Code Pro
|
|
- Noto Sans (covers Latin, Greek, Cyrillic)
|
|
- Roboto
|
|
|
|
## Example Setup
|
|
|
|
```bash
|
|
# Download Google Fonts
|
|
git clone https://github.com/google/fonts.git /tmp/fonts
|
|
|
|
# Generate database
|
|
cargo xtask gen-shape-db /tmp/fonts/ofl/liberationsans build/glyph-shapes.json
|
|
|
|
# Expected: ~5000 glyphs covering Latin, Greek, Cyrillic, symbols
|
|
```
|
|
|
|
## License Attribution
|
|
|
|
Font license texts should be stored in `build/font-licenses/` with a README.md
|
|
documenting the source and license terms for each font used.
|
|
|
|
## Algorithm
|
|
|
|
1. Load each font file using fontdue
|
|
2. For each Unicode codepoint (0x0000-0xFFFF):
|
|
- Check if font has a glyph for the character
|
|
- Rasterize at 32x32 pixels
|
|
- Center the bitmap on a 32x32 canvas
|
|
- Compute pHash via 32x32 DCT → 8x8 low-freq coefficients → median threshold
|
|
3. Deduplicate by (pHash, char) pairs
|
|
4. Handle cross-character collisions by keeping higher-frequency character
|
|
5. Sort by pHash ascending and output JSON
|
|
|
|
## Determinism
|
|
|
|
The output is byte-identical when re-run on the same input fonts and frequency data.
|