pdftract/build/frequency.json
jedarden f08369bbf0 feat(xtask): implement gen-shape-db subcommand for glyph pHash database
Add cargo xtask gen-shape-db command that walks font directories,
rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs
build/glyph-shapes.json.

Implementation details:
- Fontdue integration for TrueType/OpenType font loading
- 32x32 bitmap rasterization with centering
- DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold)
- Character frequency data for collision resolution
- Deduplication by (phash, char) pairs
- Cross-character collision handling (keep higher-frequency char)
- Sorted output by pHash ascending

Artifacts:
- build/frequency.json: Character frequency rankings
- build/README.md: Command documentation and usage

Acceptance criteria:
-  cargo xtask gen-shape-db --fonts <dir> produces valid JSON
-  Deterministic output (byte-identical on same inputs)
-  Fontdue integration and 32x32 rasterization
-  pHash computation via DCT
- ⚠️ No system fonts for full integration test (documented)

Closes: pdftract-2aq0
2026-05-24 05:40:44 -04:00

99 lines
1 KiB
JSON

{
" ": 1,
"e": 2,
"t": 3,
"a": 4,
"o": 5,
"i": 6,
"n": 7,
"s": 8,
"h": 9,
"r": 10,
"d": 11,
"l": 12,
"c": 13,
"u": 14,
"m": 15,
"w": 16,
"f": 17,
"g": 18,
"y": 19,
"p": 20,
"b": 21,
"v": 22,
"k": 23,
"j": 24,
"x": 25,
"q": 26,
"z": 27,
"E": 28,
"T": 29,
"A": 30,
"O": 31,
"I": 32,
"N": 33,
"S": 34,
"H": 35,
"R": 36,
"D": 37,
"L": 38,
"C": 39,
"U": 40,
"M": 41,
"W": 42,
"F": 43,
"G": 44,
"Y": 45,
"P": 46,
"B": 47,
"V": 48,
"K": 49,
"J": 50,
"X": 51,
"Q": 52,
"Z": 53,
"0": 54,
"1": 55,
"2": 56,
"3": 57,
"4": 58,
"5": 59,
"6": 60,
"7": 61,
"8": 62,
"9": 63,
".": 64,
",": 65,
";": 66,
":": 67,
"?": 68,
"!": 69,
"-": 70,
"(": 71,
")": 72,
"[": 73,
"]": 74,
"{": 75,
"}": 76,
"'": 77,
"\"": 78,
"/": 79,
"\\": 80,
"@": 81,
"#": 82,
"$": 83,
"%": 84,
"^": 85,
"&": 86,
"*": 87,
"+": 88,
"=": 89,
"_": 90,
"|": 91,
"~": 92,
"`": 93,
"<": 94,
">": 94,
"\n": 95,
"\t": 96
}