pdftract/build/README.md
jedarden f08369bbf0 feat(xtask): implement gen-shape-db subcommand for glyph pHash database
Add cargo xtask gen-shape-db command that walks font directories,
rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs
build/glyph-shapes.json.

Implementation details:
- Fontdue integration for TrueType/OpenType font loading
- 32x32 bitmap rasterization with centering
- DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold)
- Character frequency data for collision resolution
- Deduplication by (phash, char) pairs
- Cross-character collision handling (keep higher-frequency char)
- Sorted output by pHash ascending

Artifacts:
- build/frequency.json: Character frequency rankings
- build/README.md: Command documentation and usage

Acceptance criteria:
-  cargo xtask gen-shape-db --fonts <dir> produces valid JSON
-  Deterministic output (byte-identical on same inputs)
-  Fontdue integration and 32x32 rasterization
-  pHash computation via DCT
- ⚠️ No system fonts for full integration test (documented)

Closes: pdftract-2aq0
2026-05-24 05:40:44 -04:00

94 lines
2.3 KiB
Markdown

# Glyph Shape Database Generation
## Overview
The `cargo xtask gen-shape-db` command generates a perceptual hash (pHash) database
from TrueType/OpenType font files. This database is used for glyph shape recognition
in PDF text extraction.
## Usage
```bash
# From workspace root
cargo xtask gen-shape-db <fonts-dir> [output-path]
# Example
cargo xtask gen-shape-db /path/to/fonts build/glyph-shapes.json
```
## Arguments
- `fonts-dir`: Path to directory containing `.ttf` or `.otf` font files (recursively searched)
- `output-path`: Optional output path (default: `build/glyph-shapes.json`)
## Font Requirements
Fonts MUST be open-licensed:
- Google Fonts (Apache 2.0 / OFL)
- SIL Open Font License fonts
- Other permissive licenses compatible with PDF extraction
## Output Format
The output is a JSON array of glyph entries:
```json
[
{
"phash_hex": "0123456789abcdef",
"char": "A",
"source_font": "LiberationSans-Regular.ttf",
"frequency_rank": 30
},
...
]
```
## Character Frequency
The command reads `build/frequency.json` for character frequency rankings.
If not found, all characters are assigned rank 0.
Format: `{"A": 30, "B": 47, ...}` where higher values = more common.
## Suggested Fonts
For comprehensive coverage, use these open-licensed fonts:
- Liberation Sans
- DejaVu Sans
- Source Code Pro
- Noto Sans (covers Latin, Greek, Cyrillic)
- Roboto
## Example Setup
```bash
# Download Google Fonts
git clone https://github.com/google/fonts.git /tmp/fonts
# Generate database
cargo xtask gen-shape-db /tmp/fonts/ofl/liberationsans build/glyph-shapes.json
# Expected: ~5000 glyphs covering Latin, Greek, Cyrillic, symbols
```
## License Attribution
Font license texts should be stored in `build/font-licenses/` with a README.md
documenting the source and license terms for each font used.
## Algorithm
1. Load each font file using fontdue
2. For each Unicode codepoint (0x0000-0xFFFF):
- Check if font has a glyph for the character
- Rasterize at 32x32 pixels
- Center the bitmap on a 32x32 canvas
- Compute pHash via 32x32 DCT → 8x8 low-freq coefficients → median threshold
3. Deduplicate by (pHash, char) pairs
4. Handle cross-character collisions by keeping higher-frequency character
5. Sort by pHash ascending and output JSON
## Determinism
The output is byte-identical when re-run on the same input fonts and frequency data.