Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed font corpus and generating glyph shape database for L4 recognition. - Script downloads fonts from build/shape-corpus-manifest.txt - Copies LICENSE files to build/font-licenses/ for compliance - Idempotent: skips already-present fonts - Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32) Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target): - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic) - Roboto: 2,392 glyphs (Latin Basic, extended) - JetBrains Mono: 1,176 glyphs (monospace) - Source Code Pro: 1,124 glyphs (monospace) build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis for pHash data redistribution. Closes: pdftract-1i8n
5.4 KiB
Font License Compliance for Glyph Shape Database
Overview
The glyph shape database (build/glyph-shapes.json) is generated from open-licensed font files. This document explains the legal basis for redistributing the derived shape data and documents compliance with each font's license terms.
OFL Derivative Work Analysis
What We Derive
For each font in the corpus, we:
- Rasterize glyph outlines to 32×32 grayscale bitmaps using
fontdue - Compute a perceptual hash (pHash) via DCT → 8×8 low-frequency coefficients → median threshold
- Store only the 64-bit hash value and character association
We do NOT distribute:
- The original font files (
.ttf,.otf) - The rasterized glyph bitmaps
- Any vector outline data
- Any hinting instructions or metadata
Legal Basis: SIL Open Font License 1.1
The OFL permits derivative works under specific conditions:
"Permission is hereby granted... to use, study, copy, merge, embed, modify, redistribute, and sell modified and unmodified copies of the Font Software..."
Key Clauses
-
Derivative Works (Clause 2): OFL explicitly allows creating derivative works, provided they are distributed under the same license and the original copyright notices are included.
-
"Reserved Font Name" (Clause 4): We do not use any reserved font names in our distributed data. The shape database uses only neutral identifiers (pHash values, character codes).
-
Embedding (Clause 5): While embedding usually refers to including fonts in documents, our use case is analogous: we embed derived data (pHash values) into our binary, not the font software itself.
Why pHash Data is a Compliant Derivative Work
- Transformative Nature: The pHash is a mathematical transformation of the glyph outline, losing all visual fidelity. The original font cannot be reconstructed from the pHash.
- Minimal Extraction: We extract only 64 bits per glyph—the coarsest possible "fingerprint"—compared to the original font's kilobytes of outline data per glyph.
- No Redistribution: The original font binaries are never distributed with pdftract; users must download them separately using
scripts/fetch-shape-corpus.sh.
Apache License 2.0 (Roboto)
Roboto uses the Apache License 2.0, which explicitly permits:
"You may reproduce and distribute copies of the Work or Derivative Works thereof..."
Apache 2.0 has fewer restrictions than OFL—no "Reserved Font Name" clause, no requirement to rename derivative works. Our pHash extraction is clearly within the scope of permitted derivative works.
Corpus Coverage
Unicode Blocks Covered
| Block | Range | Source Fonts |
|---|---|---|
| Latin Basic | U+0020-U+007F | All fonts |
| Latin-1 Supplement | U+0080-U+00FF | Liberation Sans, DejaVu Sans, Noto Sans |
| Latin Extended-A | U+0100-U+017F | DejaVu Sans, Noto Sans |
| Latin Extended-B | U+0180-U+024F | DejaVu Sans, Noto Sans |
| Greek and Coptic | U+0370-U+03FF | DejaVu Sans, Noto Sans |
| Cyrillic | U+0400-U+04FF | DejaVu Sans, Noto Sans |
| Cyrillic Supplement | U+0500-U+052F | DejaVu Sans, Noto Sans |
| General Punctuation | U+2000-U+206F | All fonts |
| Superscripts/Subscripts | U+2070-U+209F | DejaVu Sans, Noto Sans |
| Currency Symbols | U+20A0-U+20CF | DejaVu Sans, Noto Sans |
| Letterlike Symbols | U+2100-U+214F | DejaVu Sans, Noto Sans |
| Mathematical Operators | U+2200-U+22FF | STIX Two Math, Latin Modern Math |
| Miscellaneous Technical | U+2300-U+23FF | DejaVu Sans, Noto Sans |
| Box Drawing | U+2500-U+257F | DejaVu Sans, Noto Sans |
| Geometric Shapes | U+25A0-U+25FF | DejaVu Sans, Noto Sans |
| Miscellaneous Symbols | U+2600-U+26FF | Noto Sans Symbols |
| Dingbats | U+2700-U+27BF | Noto Sans Symbols |
| Arrows | U+2190-U+21FF | Noto Sans Symbols, DejaVu Sans |
| Mathematical Alphanumeric Symbols | U+1D400-U+1D7FF | STIX Two Math, Latin Modern Math |
Known Coverage Gaps
The following blocks are NOT covered by the current corpus:
- CJK Unified Ideographs (U+4E00-U+9FFF): Would require CJK fonts; ~20K+ characters
- Arabic (U+0600-U+06FF): Would require Arabic-specific fonts with proper ligature support
- Hebrew (U+0590-U+05FF): Would require Hebrew-specific fonts
- Indic Scripts (Devanagari, Bengali, etc.): Each requires specialized fonts
- Emoji (U+1F000-U+1FFFF): Would require emoji-specific fonts (Noto Color Emoji)
These gaps represent known limitations: L4 glyph recognition will always fail (return U+FFFD) for characters in these blocks.
Attribution
For each font in the corpus, the full license text is stored in build/font-licenses/<family_slug>.txt. The build/shape-corpus-manifest.txt file documents:
- Font family name
- Download URL (source repository)
- License identifier (e.g., OFL-1.1, Apache-2.0)
- Target filename
Regenerating the Database
To regenerate glyph-shapes.json from the corpus:
# 1. Download the font corpus
bash scripts/fetch-shape-corpus.sh
# 2. Generate the shape database
cargo xtask gen-shape-db build/shape-corpus/ build/glyph-shapes.json
# 3. Verify coverage (> 4500 glyphs expected)
jq length build/glyph-shapes.json
References
- SIL Open Font License 1.1
- Apache License 2.0
- fontdue - Font rasterization library
- jiiver - Word Error Rate (WER) measurement (for OCR validation)