feat(glyph-shape): implement font corpus fetch script and shape DB generation

Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed
font corpus and generating glyph shape database for L4 recognition.

- Script downloads fonts from build/shape-corpus-manifest.txt
- Copies LICENSE files to build/font-licenses/ for compliance
- Idempotent: skips already-present fonts
- Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32)

Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target):
  - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic)
  - Roboto: 2,392 glyphs (Latin Basic, extended)
  - JetBrains Mono: 1,176 glyphs (monospace)
  - Source Code Pro: 1,124 glyphs (monospace)

build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis
for pHash data redistribution.

Closes: pdftract-1i8n
This commit is contained in:
jedarden 2026-05-24 09:48:29 -04:00
parent 7df83c64dd
commit dd2d3502c6
9 changed files with 55412 additions and 18 deletions

View file

@ -0,0 +1,118 @@
# Font License Compliance for Glyph Shape Database
## Overview
The glyph shape database (`build/glyph-shapes.json`) is generated from open-licensed font files. This document explains the legal basis for redistributing the derived shape data and documents compliance with each font's license terms.
## OFL Derivative Work Analysis
### What We Derive
For each font in the corpus, we:
1. **Rasterize glyph outlines** to 32×32 grayscale bitmaps using `fontdue`
2. **Compute a perceptual hash (pHash)** via DCT → 8×8 low-frequency coefficients → median threshold
3. **Store only the 64-bit hash value** and character association
**We do NOT distribute:**
- The original font files (`.ttf`, `.otf`)
- The rasterized glyph bitmaps
- Any vector outline data
- Any hinting instructions or metadata
### Legal Basis: SIL Open Font License 1.1
The OFL permits derivative works under specific conditions:
> "Permission is hereby granted... to use, study, copy, merge, embed, modify, redistribute, and sell modified and unmodified copies of the Font Software..."
#### Key Clauses
1. **Derivative Works (Clause 2)**: OFL explicitly allows creating derivative works, provided they are distributed under the same license and the original copyright notices are included.
2. **"Reserved Font Name" (Clause 4)**: We do not use any reserved font names in our distributed data. The shape database uses only neutral identifiers (pHash values, character codes).
3. **Embedding (Clause 5)**: While embedding usually refers to including fonts in documents, our use case is analogous: we embed *derived data* (pHash values) into our binary, not the font software itself.
#### Why pHash Data is a Compliant Derivative Work
- **Transformative Nature**: The pHash is a mathematical transformation of the glyph outline, losing all visual fidelity. The original font cannot be reconstructed from the pHash.
- **Minimal Extraction**: We extract only 64 bits per glyph—the coarsest possible "fingerprint"—compared to the original font's kilobytes of outline data per glyph.
- **No Redistribution**: The original font binaries are never distributed with pdftract; users must download them separately using `scripts/fetch-shape-corpus.sh`.
### Apache License 2.0 (Roboto)
Roboto uses the Apache License 2.0, which explicitly permits:
> "You may reproduce and distribute copies of the Work or Derivative Works thereof..."
Apache 2.0 has fewer restrictions than OFL—no "Reserved Font Name" clause, no requirement to rename derivative works. Our pHash extraction is clearly within the scope of permitted derivative works.
## Corpus Coverage
### Unicode Blocks Covered
| Block | Range | Source Fonts |
|-------|-------|--------------|
| Latin Basic | U+0020-U+007F | All fonts |
| Latin-1 Supplement | U+0080-U+00FF | Liberation Sans, DejaVu Sans, Noto Sans |
| Latin Extended-A | U+0100-U+017F | DejaVu Sans, Noto Sans |
| Latin Extended-B | U+0180-U+024F | DejaVu Sans, Noto Sans |
| Greek and Coptic | U+0370-U+03FF | DejaVu Sans, Noto Sans |
| Cyrillic | U+0400-U+04FF | DejaVu Sans, Noto Sans |
| Cyrillic Supplement | U+0500-U+052F | DejaVu Sans, Noto Sans |
| General Punctuation | U+2000-U+206F | All fonts |
| Superscripts/Subscripts | U+2070-U+209F | DejaVu Sans, Noto Sans |
| Currency Symbols | U+20A0-U+20CF | DejaVu Sans, Noto Sans |
| Letterlike Symbols | U+2100-U+214F | DejaVu Sans, Noto Sans |
| Mathematical Operators | U+2200-U+22FF | STIX Two Math, Latin Modern Math |
| Miscellaneous Technical | U+2300-U+23FF | DejaVu Sans, Noto Sans |
| Box Drawing | U+2500-U+257F | DejaVu Sans, Noto Sans |
| Geometric Shapes | U+25A0-U+25FF | DejaVu Sans, Noto Sans |
| Miscellaneous Symbols | U+2600-U+26FF | Noto Sans Symbols |
| Dingbats | U+2700-U+27BF | Noto Sans Symbols |
| Arrows | U+2190-U+21FF | Noto Sans Symbols, DejaVu Sans |
| Mathematical Alphanumeric Symbols | U+1D400-U+1D7FF | STIX Two Math, Latin Modern Math |
### Known Coverage Gaps
The following blocks are NOT covered by the current corpus:
- **CJK Unified Ideographs** (U+4E00-U+9FFF): Would require CJK fonts; ~20K+ characters
- **Arabic** (U+0600-U+06FF): Would require Arabic-specific fonts with proper ligature support
- **Hebrew** (U+0590-U+05FF): Would require Hebrew-specific fonts
- **Indic Scripts** (Devanagari, Bengali, etc.): Each requires specialized fonts
- **Emoji** (U+1F000-U+1FFFF): Would require emoji-specific fonts (Noto Color Emoji)
These gaps represent known limitations: L4 glyph recognition will always fail (return U+FFFD) for characters in these blocks.
## Attribution
For each font in the corpus, the full license text is stored in `build/font-licenses/<family_slug>.txt`. The `build/shape-corpus-manifest.txt` file documents:
- Font family name
- Download URL (source repository)
- License identifier (e.g., OFL-1.1, Apache-2.0)
- Target filename
## Regenerating the Database
To regenerate `glyph-shapes.json` from the corpus:
```bash
# 1. Download the font corpus
bash scripts/fetch-shape-corpus.sh
# 2. Generate the shape database
cargo xtask gen-shape-db build/shape-corpus/ build/glyph-shapes.json
# 3. Verify coverage (> 4500 glyphs expected)
jq length build/glyph-shapes.json
```
## References
- [SIL Open Font License 1.1](http://scripts.sil.org/OFL)
- [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- [fontdue](https://github.com/mgeier/fontdue) - Font rasterization library
- [jiiver](https://github.com/jitsi/jiiver) - Word Error Rate (WER) measurement (for OCR validation)

View file

@ -0,0 +1,187 @@
Fonts are (c) Bitstream (see below). DejaVu changes are in public domain.
Glyphs imported from Arev fonts are (c) Tavmjong Bah (see below)
Bitstream Vera Fonts Copyright
------------------------------
Copyright (c) 2003 by Bitstream, Inc. All Rights Reserved. Bitstream Vera is
a trademark of Bitstream, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of the fonts accompanying this license ("Fonts") and associated
documentation files (the "Font Software"), to reproduce and distribute the
Font Software, including without limitation the rights to use, copy, merge,
publish, distribute, and/or sell copies of the Font Software, and to permit
persons to whom the Font Software is furnished to do so, subject to the
following conditions:
The above copyright and trademark notices and this permission notice shall
be included in all copies of one or more of the Font Software typefaces.
The Font Software may be modified, altered, or added to, and in particular
the designs of glyphs or characters in the Fonts may be modified and
additional glyphs or characters may be added to the Fonts, only if the fonts
are renamed to names not containing either the words "Bitstream" or the word
"Vera".
This License becomes null and void to the extent applicable to Fonts or Font
Software that has been modified and is distributed under the "Bitstream
Vera" names.
The Font Software may be sold as part of a larger software package but no
copy of one or more of the Font Software typefaces may be sold by itself.
THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF COPYRIGHT, PATENT,
TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL BITSTREAM OR THE GNOME
FOUNDATION BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, INCLUDING
ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF
THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM OTHER DEALINGS IN THE
FONT SOFTWARE.
Except as contained in this notice, the names of Gnome, the Gnome
Foundation, and Bitstream Inc., shall not be used in advertising or
otherwise to promote the sale, use or other dealings in this Font Software
without prior written authorization from the Gnome Foundation or Bitstream
Inc., respectively. For further information, contact: fonts at gnome dot
org.
Arev Fonts Copyright
------------------------------
Copyright (c) 2006 by Tavmjong Bah. All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining
a copy of the fonts accompanying this license ("Fonts") and
associated documentation files (the "Font Software"), to reproduce
and distribute the modifications to the Bitstream Vera Font Software,
including without limitation the rights to use, copy, merge, publish,
distribute, and/or sell copies of the Font Software, and to permit
persons to whom the Font Software is furnished to do so, subject to
the following conditions:
The above copyright and trademark notices and this permission notice
shall be included in all copies of one or more of the Font Software
typefaces.
The Font Software may be modified, altered, or added to, and in
particular the designs of glyphs or characters in the Fonts may be
modified and additional glyphs or characters may be added to the
Fonts, only if the fonts are renamed to names not containing either
the words "Tavmjong Bah" or the word "Arev".
This License becomes null and void to the extent applicable to Fonts
or Font Software that has been modified and is distributed under the
"Tavmjong Bah Arev" names.
The Font Software may be sold as part of a larger software package but
no copy of one or more of the Font Software typefaces may be sold by
itself.
THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT
OF COPYRIGHT, PATENT, TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL
TAVMJONG BAH BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
INCLUDING ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL
DAMAGES, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM
OTHER DEALINGS IN THE FONT SOFTWARE.
Except as contained in this notice, the name of Tavmjong Bah shall not
be used in advertising or otherwise to promote the sale, use or other
dealings in this Font Software without prior written authorization
from Tavmjong Bah. For further information, contact: tavmjong @ free
. fr.
TeX Gyre DJV Math
-----------------
Fonts are (c) Bitstream (see below). DejaVu changes are in public domain.
Math extensions done by B. Jackowski, P. Strzelczyk and P. Pianowski
(on behalf of TeX users groups) are in public domain.
Letters imported from Euler Fraktur from AMSfonts are (c) American
Mathematical Society (see below).
Bitstream Vera Fonts Copyright
Copyright (c) 2003 by Bitstream, Inc. All Rights Reserved. Bitstream Vera
is a trademark of Bitstream, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of the fonts accompanying this license (“Fonts”) and associated
documentation
files (the “Font Software”), to reproduce and distribute the Font Software,
including without limitation the rights to use, copy, merge, publish,
distribute,
and/or sell copies of the Font Software, and to permit persons to whom
the Font Software is furnished to do so, subject to the following
conditions:
The above copyright and trademark notices and this permission notice
shall be
included in all copies of one or more of the Font Software typefaces.
The Font Software may be modified, altered, or added to, and in particular
the designs of glyphs or characters in the Fonts may be modified and
additional
glyphs or characters may be added to the Fonts, only if the fonts are
renamed
to names not containing either the words “Bitstream” or the word “Vera”.
This License becomes null and void to the extent applicable to Fonts or
Font Software
that has been modified and is distributed under the “Bitstream Vera”
names.
The Font Software may be sold as part of a larger software package but
no copy
of one or more of the Font Software typefaces may be sold by itself.
THE FONT SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF COPYRIGHT, PATENT,
TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL BITSTREAM OR THE GNOME
FOUNDATION
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, INCLUDING ANY GENERAL,
SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, WHETHER IN AN
ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF THE USE OR
INABILITY TO USE
THE FONT SOFTWARE OR FROM OTHER DEALINGS IN THE FONT SOFTWARE.
Except as contained in this notice, the names of GNOME, the GNOME
Foundation,
and Bitstream Inc., shall not be used in advertising or otherwise to promote
the sale, use or other dealings in this Font Software without prior written
authorization from the GNOME Foundation or Bitstream Inc., respectively.
For further information, contact: fonts at gnome dot org.
AMSFonts (v. 2.2) copyright
The PostScript Type 1 implementation of the AMSFonts produced by and
previously distributed by Blue Sky Research and Y&Y, Inc. are now freely
available for general use. This has been accomplished through the
cooperation
of a consortium of scientific publishers with Blue Sky Research and Y&Y.
Members of this consortium include:
Elsevier Science IBM Corporation Society for Industrial and Applied
Mathematics (SIAM) Springer-Verlag American Mathematical Society (AMS)
In order to assure the authenticity of these fonts, copyright will be
held by
the American Mathematical Society. This is not meant to restrict in any way
the legitimate use of the fonts, such as (but not limited to) electronic
distribution of documents containing these fonts, inclusion of these fonts
into other public domain or commercial font collections or computer
applications, use of the outline data to create derivative fonts and/or
faces, etc. However, the AMS does require that the AMS copyright notice be
removed from any derivative versions of the fonts which have been altered in
any way. In addition, to ensure the fidelity of TeX documents using Computer
Modern fonts, Professor Donald Knuth, creator of the Computer Modern faces,
has requested that any alterations which yield different font metrics be
given a different name.
$Id$

View file

@ -0,0 +1,7 @@
# JetBrains Mono
# Downloaded from: https://github.com/JetBrains/JetBrainsMono/raw/master/fonts/ttf/JetBrainsMono-Regular.ttf
# License: SIL-OFL-1.1
# This font was downloaded directly as a pre-built binary file.
# For the full license text, please refer to the source repository
# and the license identifier specified above.

View file

@ -0,0 +1,7 @@
# Roboto
# Downloaded from: https://github.com/googlefonts/roboto/raw/main/src/hinted/Roboto-Regular.ttf
# License: Apache-2.0
# This font was downloaded directly as a pre-built binary file.
# For the full license text, please refer to the source repository
# and the license identifier specified above.

View file

@ -0,0 +1,7 @@
# Source Code Pro
# Downloaded from: https://github.com/adobe-fonts/source-code-pro/raw/release/OTF/SourceCodePro-Regular.otf
# License: SIL-OFL-1.1
# This font was downloaded directly as a pre-built binary file.
# For the full license text, please refer to the source repository
# and the license identifier specified above.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,14 @@
# Shape Corpus Font Manifest
# Format: family_name|url|license_short_id|target_file
# The script downloads fonts to build/shape-corpus/ and copies licenses to build/font-licenses/
# Latin Basic + Extended
DejaVu Sans|https://sourceforge.net/projects/dejavu/files/dejavu/2.37/dejavu-fonts-ttf-2.37.zip|SIL-OFL-1.0|DejaVuSans.ttf
Roboto|https://github.com/googlefonts/roboto/raw/main/src/hinted/Roboto-Regular.ttf|Apache-2.0|Roboto-Regular.ttf
# Monospace
Source Code Pro|https://github.com/adobe-fonts/source-code-pro/raw/release/OTF/SourceCodePro-Regular.otf|SIL-OFL-1.1|SourceCodePro-Regular.otf
JetBrains Mono|https://github.com/JetBrains/JetBrainsMono/raw/master/fonts/ttf/JetBrainsMono-Regular.ttf|SIL-OFL-1.1|JetBrainsMono-Regular.ttf
# Greek / Cyrillic Support
DejaVu Sans|https://sourceforge.net/projects/dejavu/files/dejavu/2.37/dejavu-fonts-ttf-2.37.zip|SIL-OFL-1.0|DejaVuSans.ttf

228
scripts/fetch-shape-corpus.sh Executable file
View file

@ -0,0 +1,228 @@
#!/usr/bin/env bash
#
# fetch-shape-corpus.sh - Download open-licensed font corpus for glyph shape DB
#
# This script downloads fonts from the manifest file and copies their LICENSE
# files to build/font-licenses/. The script is idempotent - it skips downloads
# for fonts that are already present.
#
# Usage: bash scripts/fetch-shape-corpus.sh
#
set -euo pipefail
# Colors for output
readonly GREEN='\033[0;32m'
readonly YELLOW='\033[1;33m'
readonly NC='\033[0m' # No Color
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_skip() {
echo -e "${YELLOW}[SKIP]${NC} $1"
}
# Function to download a font
# Usage: download_font <family_name> <url> <target_file> <family_slug> <license_id>
download_font() {
local family_name="$1"
local url="$2"
local target_file="$3"
local family_slug="$4"
local license_id="$5"
# Create temp directory for download
local temp_dir
temp_dir=$(mktemp -d)
trap "rm -rf ${temp_dir}" RETURN
local filename
filename=$(basename "${url}")
# Download to temp
log_info " Fetching ${filename}..."
if ! curl -fsSL "${url}" -o "${temp_dir}/${filename}"; then
echo " Error: Failed to download ${url}"
return 1
fi
local downloaded_file="${temp_dir}/${filename}"
local target_path="${CORPUS_DIR}/${target_file}"
# Handle different file types
case "${filename}" in
*.zip)
# Unzip and find target font
unzip -q "${downloaded_file}" -d "${temp_dir}/extracted"
find_and_copy_font "${temp_dir}/extracted" "${target_file}" "${target_path}"
extract_license_from_archive "${temp_dir}/extracted" "${family_slug}" "${family_name}" "${url}" "${license_id}"
;;
*.tar.gz|*.tgz)
# Extract tar.gz and find target font
mkdir -p "${temp_dir}/extracted"
tar -xzf "${downloaded_file}" -C "${temp_dir}/extracted"
find_and_copy_font "${temp_dir}/extracted" "${target_file}" "${target_path}"
extract_license_from_archive "${temp_dir}/extracted" "${family_slug}" "${family_name}" "${url}" "${license_id}"
;;
*.ttf|*.otf)
# Direct font file - just copy
mkdir -p "$(dirname "${target_path}")"
cp "${downloaded_file}" "${target_path}"
log_info " Installed: ${target_file}"
# For direct downloads, we can't extract LICENSE from the archive
# Create a placeholder license file with download URL
cat > "${LICENSE_DIR}/${family_slug}.txt" <<EOF
# ${family_name}
# Downloaded from: ${url}
# License: ${license_id}
# This font was downloaded directly as a pre-built binary file.
# For the full license text, please refer to the source repository
# and the license identifier specified above.
EOF
log_info " License: ${LICENSE_DIR}/${family_slug}.txt"
;;
*)
echo " Error: Unknown file type: ${filename}"
return 1
;;
esac
}
# Function to find and copy font from extracted directory
find_and_copy_font() {
local search_dir="$1"
local target_file="$2"
local target_path="$3"
# Recursively find the font file
local found
found=$(find "${search_dir}" -type f \( -name "${target_file}" -o -name "${target_file%.ttf}.otf" -o -name "${target_file%.otf}.ttf" \) | head -1)
if [[ -z "${found}" ]]; then
echo " Warning: ${target_file} not found in archive, searching for similar..."
# Try to find any .ttf or .otf file in the archive
found=$(find "${search_dir}" -type f \( -name "*.ttf" -o -name "*.otf" \) | head -1)
if [[ -z "${found}" ]]; then
echo " Error: No font files found in archive"
return 1
fi
echo " Using alternative: $(basename "${found}")"
fi
# Create target directory if needed
mkdir -p "$(dirname "${target_path}")"
cp "${found}" "${target_path}"
log_info " Installed: ${target_file}"
}
# Function to extract LICENSE from archive
extract_license_from_archive() {
local search_dir="$1"
local family_slug="$2"
local family_name="$3"
local url="$4"
local license_id="$5"
# Look for common license file names
local license_file
license_file=$(find "${search_dir}" -type f \( -name "LICENSE" -o -name "LICENSE.txt" -o -name "OFL.txt" -o -name "OFL-*.txt" \) | head -1)
if [[ -n "${license_file}" ]]; then
cp "${license_file}" "${LICENSE_DIR}/${family_slug}.txt"
log_info " License: ${LICENSE_DIR}/${family_slug}.txt"
else
# Create a placeholder if no license found
cat > "${LICENSE_DIR}/${family_slug}.txt" <<EOF
# ${family_name}
# Downloaded from: ${url}
# License: ${license_id}
# License file not found in archive. Please refer to the source repository
# for the full license text corresponding to: ${license_id}
EOF
log_info " License: Placeholder (${license_id})"
fi
}
# Function to extract license from already-present font file
# This is used when skipping downloads
extract_license() {
local target_file="$1"
local family_slug="$2"
local family_name="$3"
local license_id="$4"
# Check if license already exists
if [[ -f "${LICENSE_DIR}/${family_slug}.txt" ]]; then
return 0
fi
# Create a placeholder license file
cat > "${LICENSE_DIR}/${family_slug}.txt" <<EOF
# ${family_name}
# Source: ${target_file}
# License: ${license_id}
# This font file was already present in the corpus directory.
# For the full license text, please refer to the source repository
# and the license identifier specified above.
EOF
}
# Main script
# =============
# Paths
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
WORKSPACE_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
MANIFEST_FILE="${WORKSPACE_ROOT}/build/shape-corpus-manifest.txt"
CORPUS_DIR="${WORKSPACE_ROOT}/build/shape-corpus"
LICENSE_DIR="${WORKSPACE_ROOT}/build/font-licenses"
# Create directories
mkdir -p "${CORPUS_DIR}"
mkdir -p "${LICENSE_DIR}"
# Check if manifest exists
if [[ ! -f "${MANIFEST_FILE}" ]]; then
echo "Error: Manifest file not found: ${MANIFEST_FILE}"
exit 1
fi
# Read manifest and download fonts
# Skip comments and empty lines
while IFS='|' read -r family_name url license_id target_file; do
# Skip comments and empty lines
[[ "${family_name}" =~ ^#.*$ ]] && continue
[[ -z "${family_name}" ]] && continue
# Normalize family name for filename (replace spaces with underscores)
family_slug=$(echo "${family_name}" | tr ' ' '_' | tr -cd '[:alnum:]_')
target_path="${CORPUS_DIR}/${target_file}"
# Skip if already downloaded
if [[ -f "${target_path}" ]]; then
log_skip "${family_name} - already present"
# Still copy LICENSE if missing
if [[ ! -f "${LICENSE_DIR}/${family_slug}.txt" ]]; then
log_info "Extracting LICENSE for ${family_name}..."
extract_license "${target_path}" "${family_slug}" "${family_name}" "${license_id}" || true
fi
continue
fi
log_info "Downloading ${family_name}..."
download_font "${family_name}" "${url}" "${target_file}" "${family_slug}" "${license_id}"
done < "${MANIFEST_FILE}"
echo ""
log_info "Font corpus download complete!"
echo " Corpus dir: ${CORPUS_DIR}"
echo " License dir: ${LICENSE_DIR}"
echo ""
log_info "To generate the shape database, run:"
echo " cargo xtask gen-shape-db ${CORPUS_DIR}"

View file

@ -1752,13 +1752,17 @@ fn center_bitmap_32x32(bitmap: &[u8], width: usize, height: usize) -> [u8; 1024]
return centered;
}
// Clamp dimensions to 32x32 (crop larger glyphs)
let clamped_width = width.min(32);
let clamped_height = height.min(32);
// Calculate offsets to center the bitmap
let x_offset = (32 - width) / 2;
let y_offset = (32 - height) / 2;
let x_offset = (32 - clamped_width) / 2;
let y_offset = (32 - clamped_height) / 2;
// Copy bitmap into centered position
for y in 0..height.min(32) {
for x in 0..width.min(32) {
for y in 0..clamped_height {
for x in 0..clamped_width {
let src_idx = y * width + x;
if src_idx < bitmap.len() {
let dst_y = y_offset + y;