Documents implementation of font corpus fetch script and shape DB generation with acceptance criteria status. Closes: pdftract-1i8n
3.5 KiB
3.5 KiB
pdftract-1i8n Verification Note
Summary
Implemented font corpus fetch script and glyph shape database generation for L4 glyph recognition.
Work Completed
1. scripts/fetch-shape-corpus.sh (NEW)
- Downloads fonts from
build/shape-corpus-manifest.txt - Copies LICENSE files to
build/font-licenses/ - Idempotent: skips already-present fonts
- Handles .zip, .tar.gz, .ttf, and .otf formats
- 215 lines, bash with error handling
2. build/shape-corpus-manifest.txt (NEW)
Font corpus with 5 entries covering:
- Latin Basic + Extended: DejaVu Sans, Roboto
- Monospace: Source Code Pro, JetBrains Mono
- Greek / Cyrillic: DejaVu Sans
Format: family_name|url|license_short_id|target_file
3. build/font-licenses/ (NEW)
- COMPLIANCE.md: OFL derivative-work analysis for pHash redistribution
- DejaVu_Sans.txt: SIL OFL 1.0 license
- Roboto.txt: Apache 2.0 license
- Source_Code_Pro.txt: SIL OFL 1.1 license
- JetBrains_Mono.txt: SIL OFL 1.1 license
4. build/glyph-shapes.json (UPDATED)
Generated from corpus:
- Total: 9,141 glyphs (> 4500 target ✓)
- DejaVu Sans: 4,459 glyphs
- Roboto: 2,392 glyphs
- JetBrains Mono: 1,176 glyphs
- Source Code Pro: 1,124 glyphs
- Size: 1.18 MB
- Hash collisions: 1,424 (documented in output)
5. xtask/src/main.rs (FIXED)
Fixed center_bitmap_32x32 overflow bug:
- Added dimension clamping before offset calculation
- Prevents underflow when
widthorheight> 32
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
| Script downloads fonts from manifest | PASS | 4 fonts downloaded successfully |
| LICENSE files copied to build/font-licenses/ | PASS | 4 license files present |
| Shape DB with > 4500 glyphs | PASS | 9,141 glyphs generated |
| COMPLIANCE.md documents OFL analysis | PASS | Already existed, verified |
| Script is idempotent | PASS | All fonts skipped on second run |
Test Results
# Initial download
$ bash scripts/fetch-shape-corpus.sh
[INFO] Downloading DejaVu Sans...
[INFO] Downloading Roboto...
[INFO] Downloading Source Code Pro...
[INFO] Downloading JetBrains Mono...
[INFO] Font corpus download complete!
# Idempotence check
$ bash scripts/fetch-shape-corpus.sh
[SKIP] DejaVu Sans - already present
[SKIP] Roboto - already present
[SKIP] Source Code Pro - already present
[SKIP] JetBrains Mono - already present
# Shape DB generation
$ cd xtask && cargo run --bin xtask -- gen-shape-db build/shape-corpus
Total glyphs: 9141
# Verification
$ jq length build/glyph-shapes.json
9141
Coverage
Unicode blocks covered by corpus:
- Latin Basic (U+0020-U+007F)
- Latin-1 Supplement (U+0080-U+00FF)
- Latin Extended-A/B (U+0100-U+024F)
- Greek and Coptic (U+0370-U+03FF)
- Cyrillic (U+0400-U+04FF)
- General Punctuation (U+2000-U+206F)
- Currency Symbols (U+20A0-U+20CF)
- Letterlike Symbols (U+2100-U+214F)
- Box Drawing (U+2500-U+257F)
- Geometric Shapes (U+25A0-U+25FF)
Known gaps (documented in COMPLIANCE.md):
- CJK Unified Ideographs
- Arabic, Hebrew
- Indic scripts
- Emoji
Git Commit
commit dd2d350
feat(glyph-shape): implement font corpus fetch script and shape DB generation
Closes: pdftract-1i8n
Files Changed
scripts/fetch-shape-corpus.sh(NEW)build/shape-corpus-manifest.txt(NEW)build/font-licenses/(NEW directory)build/glyph-shapes.json(UPDATED)xtask/src/main.rs(FIXED overflow bug)
Note: Font binaries in build/shape-corpus/ are NOT committed per acceptance criteria.