pdftract/build/font-licenses/COMPLIANCE.md
jedarden dd2d3502c6 feat(glyph-shape): implement font corpus fetch script and shape DB generation
Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed
font corpus and generating glyph shape database for L4 recognition.

- Script downloads fonts from build/shape-corpus-manifest.txt
- Copies LICENSE files to build/font-licenses/ for compliance
- Idempotent: skips already-present fonts
- Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32)

Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target):
  - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic)
  - Roboto: 2,392 glyphs (Latin Basic, extended)
  - JetBrains Mono: 1,176 glyphs (monospace)
  - Source Code Pro: 1,124 glyphs (monospace)

build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis
for pHash data redistribution.

Closes: pdftract-1i8n
2026-05-24 09:48:29 -04:00

5.4 KiB
Raw Blame History

Font License Compliance for Glyph Shape Database

Overview

The glyph shape database (build/glyph-shapes.json) is generated from open-licensed font files. This document explains the legal basis for redistributing the derived shape data and documents compliance with each font's license terms.

OFL Derivative Work Analysis

What We Derive

For each font in the corpus, we:

  1. Rasterize glyph outlines to 32×32 grayscale bitmaps using fontdue
  2. Compute a perceptual hash (pHash) via DCT → 8×8 low-frequency coefficients → median threshold
  3. Store only the 64-bit hash value and character association

We do NOT distribute:

  • The original font files (.ttf, .otf)
  • The rasterized glyph bitmaps
  • Any vector outline data
  • Any hinting instructions or metadata

The OFL permits derivative works under specific conditions:

"Permission is hereby granted... to use, study, copy, merge, embed, modify, redistribute, and sell modified and unmodified copies of the Font Software..."

Key Clauses

  1. Derivative Works (Clause 2): OFL explicitly allows creating derivative works, provided they are distributed under the same license and the original copyright notices are included.

  2. "Reserved Font Name" (Clause 4): We do not use any reserved font names in our distributed data. The shape database uses only neutral identifiers (pHash values, character codes).

  3. Embedding (Clause 5): While embedding usually refers to including fonts in documents, our use case is analogous: we embed derived data (pHash values) into our binary, not the font software itself.

Why pHash Data is a Compliant Derivative Work

  • Transformative Nature: The pHash is a mathematical transformation of the glyph outline, losing all visual fidelity. The original font cannot be reconstructed from the pHash.
  • Minimal Extraction: We extract only 64 bits per glyph—the coarsest possible "fingerprint"—compared to the original font's kilobytes of outline data per glyph.
  • No Redistribution: The original font binaries are never distributed with pdftract; users must download them separately using scripts/fetch-shape-corpus.sh.

Apache License 2.0 (Roboto)

Roboto uses the Apache License 2.0, which explicitly permits:

"You may reproduce and distribute copies of the Work or Derivative Works thereof..."

Apache 2.0 has fewer restrictions than OFL—no "Reserved Font Name" clause, no requirement to rename derivative works. Our pHash extraction is clearly within the scope of permitted derivative works.

Corpus Coverage

Unicode Blocks Covered

Block Range Source Fonts
Latin Basic U+0020-U+007F All fonts
Latin-1 Supplement U+0080-U+00FF Liberation Sans, DejaVu Sans, Noto Sans
Latin Extended-A U+0100-U+017F DejaVu Sans, Noto Sans
Latin Extended-B U+0180-U+024F DejaVu Sans, Noto Sans
Greek and Coptic U+0370-U+03FF DejaVu Sans, Noto Sans
Cyrillic U+0400-U+04FF DejaVu Sans, Noto Sans
Cyrillic Supplement U+0500-U+052F DejaVu Sans, Noto Sans
General Punctuation U+2000-U+206F All fonts
Superscripts/Subscripts U+2070-U+209F DejaVu Sans, Noto Sans
Currency Symbols U+20A0-U+20CF DejaVu Sans, Noto Sans
Letterlike Symbols U+2100-U+214F DejaVu Sans, Noto Sans
Mathematical Operators U+2200-U+22FF STIX Two Math, Latin Modern Math
Miscellaneous Technical U+2300-U+23FF DejaVu Sans, Noto Sans
Box Drawing U+2500-U+257F DejaVu Sans, Noto Sans
Geometric Shapes U+25A0-U+25FF DejaVu Sans, Noto Sans
Miscellaneous Symbols U+2600-U+26FF Noto Sans Symbols
Dingbats U+2700-U+27BF Noto Sans Symbols
Arrows U+2190-U+21FF Noto Sans Symbols, DejaVu Sans
Mathematical Alphanumeric Symbols U+1D400-U+1D7FF STIX Two Math, Latin Modern Math

Known Coverage Gaps

The following blocks are NOT covered by the current corpus:

  • CJK Unified Ideographs (U+4E00-U+9FFF): Would require CJK fonts; ~20K+ characters
  • Arabic (U+0600-U+06FF): Would require Arabic-specific fonts with proper ligature support
  • Hebrew (U+0590-U+05FF): Would require Hebrew-specific fonts
  • Indic Scripts (Devanagari, Bengali, etc.): Each requires specialized fonts
  • Emoji (U+1F000-U+1FFFF): Would require emoji-specific fonts (Noto Color Emoji)

These gaps represent known limitations: L4 glyph recognition will always fail (return U+FFFD) for characters in these blocks.

Attribution

For each font in the corpus, the full license text is stored in build/font-licenses/<family_slug>.txt. The build/shape-corpus-manifest.txt file documents:

  • Font family name
  • Download URL (source repository)
  • License identifier (e.g., OFL-1.1, Apache-2.0)
  • Target filename

Regenerating the Database

To regenerate glyph-shapes.json from the corpus:

# 1. Download the font corpus
bash scripts/fetch-shape-corpus.sh

# 2. Generate the shape database
cargo xtask gen-shape-db build/shape-corpus/ build/glyph-shapes.json

# 3. Verify coverage (> 4500 glyphs expected)
jq length build/glyph-shapes.json

References