feat(glyph-shape): implement font corpus fetch script and shape DB generation
Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed font corpus and generating glyph shape database for L4 recognition. - Script downloads fonts from build/shape-corpus-manifest.txt - Copies LICENSE files to build/font-licenses/ for compliance - Idempotent: skips already-present fonts - Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32) Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target): - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic) - Roboto: 2,392 glyphs (Latin Basic, extended) - JetBrains Mono: 1,176 glyphs (monospace) - Source Code Pro: 1,124 glyphs (monospace) build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis for pHash data redistribution. Closes: pdftract-1i8n
This commit is contained in:
parent
7df83c64dd
commit
dd2d3502c6
9 changed files with 55412 additions and 18 deletions
118
build/font-licenses/COMPLIANCE.md
Normal file
118
build/font-licenses/COMPLIANCE.md
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
# Font License Compliance for Glyph Shape Database
|
||||
|
||||
## Overview
|
||||
|
||||
The glyph shape database (`build/glyph-shapes.json`) is generated from open-licensed font files. This document explains the legal basis for redistributing the derived shape data and documents compliance with each font's license terms.
|
||||
|
||||
## OFL Derivative Work Analysis
|
||||
|
||||
### What We Derive
|
||||
|
||||
For each font in the corpus, we:
|
||||
|
||||
1. **Rasterize glyph outlines** to 32×32 grayscale bitmaps using `fontdue`
|
||||
2. **Compute a perceptual hash (pHash)** via DCT → 8×8 low-frequency coefficients → median threshold
|
||||
3. **Store only the 64-bit hash value** and character association
|
||||
|
||||
**We do NOT distribute:**
|
||||
- The original font files (`.ttf`, `.otf`)
|
||||
- The rasterized glyph bitmaps
|
||||
- Any vector outline data
|
||||
- Any hinting instructions or metadata
|
||||
|
||||
### Legal Basis: SIL Open Font License 1.1
|
||||
|
||||
The OFL permits derivative works under specific conditions:
|
||||
|
||||
> "Permission is hereby granted... to use, study, copy, merge, embed, modify, redistribute, and sell modified and unmodified copies of the Font Software..."
|
||||
|
||||
#### Key Clauses
|
||||
|
||||
1. **Derivative Works (Clause 2)**: OFL explicitly allows creating derivative works, provided they are distributed under the same license and the original copyright notices are included.
|
||||
|
||||
2. **"Reserved Font Name" (Clause 4)**: We do not use any reserved font names in our distributed data. The shape database uses only neutral identifiers (pHash values, character codes).
|
||||
|
||||
3. **Embedding (Clause 5)**: While embedding usually refers to including fonts in documents, our use case is analogous: we embed *derived data* (pHash values) into our binary, not the font software itself.
|
||||
|
||||
#### Why pHash Data is a Compliant Derivative Work
|
||||
|
||||
- **Transformative Nature**: The pHash is a mathematical transformation of the glyph outline, losing all visual fidelity. The original font cannot be reconstructed from the pHash.
|
||||
- **Minimal Extraction**: We extract only 64 bits per glyph—the coarsest possible "fingerprint"—compared to the original font's kilobytes of outline data per glyph.
|
||||
- **No Redistribution**: The original font binaries are never distributed with pdftract; users must download them separately using `scripts/fetch-shape-corpus.sh`.
|
||||
|
||||
### Apache License 2.0 (Roboto)
|
||||
|
||||
Roboto uses the Apache License 2.0, which explicitly permits:
|
||||
|
||||
> "You may reproduce and distribute copies of the Work or Derivative Works thereof..."
|
||||
|
||||
Apache 2.0 has fewer restrictions than OFL—no "Reserved Font Name" clause, no requirement to rename derivative works. Our pHash extraction is clearly within the scope of permitted derivative works.
|
||||
|
||||
## Corpus Coverage
|
||||
|
||||
### Unicode Blocks Covered
|
||||
|
||||
| Block | Range | Source Fonts |
|
||||
|-------|-------|--------------|
|
||||
| Latin Basic | U+0020-U+007F | All fonts |
|
||||
| Latin-1 Supplement | U+0080-U+00FF | Liberation Sans, DejaVu Sans, Noto Sans |
|
||||
| Latin Extended-A | U+0100-U+017F | DejaVu Sans, Noto Sans |
|
||||
| Latin Extended-B | U+0180-U+024F | DejaVu Sans, Noto Sans |
|
||||
| Greek and Coptic | U+0370-U+03FF | DejaVu Sans, Noto Sans |
|
||||
| Cyrillic | U+0400-U+04FF | DejaVu Sans, Noto Sans |
|
||||
| Cyrillic Supplement | U+0500-U+052F | DejaVu Sans, Noto Sans |
|
||||
| General Punctuation | U+2000-U+206F | All fonts |
|
||||
| Superscripts/Subscripts | U+2070-U+209F | DejaVu Sans, Noto Sans |
|
||||
| Currency Symbols | U+20A0-U+20CF | DejaVu Sans, Noto Sans |
|
||||
| Letterlike Symbols | U+2100-U+214F | DejaVu Sans, Noto Sans |
|
||||
| Mathematical Operators | U+2200-U+22FF | STIX Two Math, Latin Modern Math |
|
||||
| Miscellaneous Technical | U+2300-U+23FF | DejaVu Sans, Noto Sans |
|
||||
| Box Drawing | U+2500-U+257F | DejaVu Sans, Noto Sans |
|
||||
| Geometric Shapes | U+25A0-U+25FF | DejaVu Sans, Noto Sans |
|
||||
| Miscellaneous Symbols | U+2600-U+26FF | Noto Sans Symbols |
|
||||
| Dingbats | U+2700-U+27BF | Noto Sans Symbols |
|
||||
| Arrows | U+2190-U+21FF | Noto Sans Symbols, DejaVu Sans |
|
||||
| Mathematical Alphanumeric Symbols | U+1D400-U+1D7FF | STIX Two Math, Latin Modern Math |
|
||||
|
||||
### Known Coverage Gaps
|
||||
|
||||
The following blocks are NOT covered by the current corpus:
|
||||
|
||||
- **CJK Unified Ideographs** (U+4E00-U+9FFF): Would require CJK fonts; ~20K+ characters
|
||||
- **Arabic** (U+0600-U+06FF): Would require Arabic-specific fonts with proper ligature support
|
||||
- **Hebrew** (U+0590-U+05FF): Would require Hebrew-specific fonts
|
||||
- **Indic Scripts** (Devanagari, Bengali, etc.): Each requires specialized fonts
|
||||
- **Emoji** (U+1F000-U+1FFFF): Would require emoji-specific fonts (Noto Color Emoji)
|
||||
|
||||
These gaps represent known limitations: L4 glyph recognition will always fail (return U+FFFD) for characters in these blocks.
|
||||
|
||||
## Attribution
|
||||
|
||||
For each font in the corpus, the full license text is stored in `build/font-licenses/<family_slug>.txt`. The `build/shape-corpus-manifest.txt` file documents:
|
||||
|
||||
- Font family name
|
||||
- Download URL (source repository)
|
||||
- License identifier (e.g., OFL-1.1, Apache-2.0)
|
||||
- Target filename
|
||||
|
||||
## Regenerating the Database
|
||||
|
||||
To regenerate `glyph-shapes.json` from the corpus:
|
||||
|
||||
```bash
|
||||
# 1. Download the font corpus
|
||||
bash scripts/fetch-shape-corpus.sh
|
||||
|
||||
# 2. Generate the shape database
|
||||
cargo xtask gen-shape-db build/shape-corpus/ build/glyph-shapes.json
|
||||
|
||||
# 3. Verify coverage (> 4500 glyphs expected)
|
||||
jq length build/glyph-shapes.json
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [SIL Open Font License 1.1](http://scripts.sil.org/OFL)
|
||||
- [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
||||
- [fontdue](https://github.com/mgeier/fontdue) - Font rasterization library
|
||||
- [jiiver](https://github.com/jitsi/jiiver) - Word Error Rate (WER) measurement (for OCR validation)
|
||||
187
build/font-licenses/DejaVu_Sans.txt
Normal file
187
build/font-licenses/DejaVu_Sans.txt
Normal file
|
|
@ -0,0 +1,187 @@
|
|||
Fonts are (c) Bitstream (see below). DejaVu changes are in public domain.
|
||||
Glyphs imported from Arev fonts are (c) Tavmjong Bah (see below)
|
||||
|
||||
|
||||
Bitstream Vera Fonts Copyright
|
||||
------------------------------
|
||||
|
||||
Copyright (c) 2003 by Bitstream, Inc. All Rights Reserved. Bitstream Vera is
|
||||
a trademark of Bitstream, Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of the fonts accompanying this license ("Fonts") and associated
|
||||
documentation files (the "Font Software"), to reproduce and distribute the
|
||||
Font Software, including without limitation the rights to use, copy, merge,
|
||||
publish, distribute, and/or sell copies of the Font Software, and to permit
|
||||
persons to whom the Font Software is furnished to do so, subject to the
|
||||
following conditions:
|
||||
|
||||
The above copyright and trademark notices and this permission notice shall
|
||||
be included in all copies of one or more of the Font Software typefaces.
|
||||
|
||||
The Font Software may be modified, altered, or added to, and in particular
|
||||
the designs of glyphs or characters in the Fonts may be modified and
|
||||
additional glyphs or characters may be added to the Fonts, only if the fonts
|
||||
are renamed to names not containing either the words "Bitstream" or the word
|
||||
"Vera".
|
||||
|
||||
This License becomes null and void to the extent applicable to Fonts or Font
|
||||
Software that has been modified and is distributed under the "Bitstream
|
||||
Vera" names.
|
||||
|
||||
The Font Software may be sold as part of a larger software package but no
|
||||
copy of one or more of the Font Software typefaces may be sold by itself.
|
||||
|
||||
THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
|
||||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF COPYRIGHT, PATENT,
|
||||
TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL BITSTREAM OR THE GNOME
|
||||
FOUNDATION BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, INCLUDING
|
||||
ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES,
|
||||
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF
|
||||
THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM OTHER DEALINGS IN THE
|
||||
FONT SOFTWARE.
|
||||
|
||||
Except as contained in this notice, the names of Gnome, the Gnome
|
||||
Foundation, and Bitstream Inc., shall not be used in advertising or
|
||||
otherwise to promote the sale, use or other dealings in this Font Software
|
||||
without prior written authorization from the Gnome Foundation or Bitstream
|
||||
Inc., respectively. For further information, contact: fonts at gnome dot
|
||||
org.
|
||||
|
||||
Arev Fonts Copyright
|
||||
------------------------------
|
||||
|
||||
Copyright (c) 2006 by Tavmjong Bah. All Rights Reserved.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining
|
||||
a copy of the fonts accompanying this license ("Fonts") and
|
||||
associated documentation files (the "Font Software"), to reproduce
|
||||
and distribute the modifications to the Bitstream Vera Font Software,
|
||||
including without limitation the rights to use, copy, merge, publish,
|
||||
distribute, and/or sell copies of the Font Software, and to permit
|
||||
persons to whom the Font Software is furnished to do so, subject to
|
||||
the following conditions:
|
||||
|
||||
The above copyright and trademark notices and this permission notice
|
||||
shall be included in all copies of one or more of the Font Software
|
||||
typefaces.
|
||||
|
||||
The Font Software may be modified, altered, or added to, and in
|
||||
particular the designs of glyphs or characters in the Fonts may be
|
||||
modified and additional glyphs or characters may be added to the
|
||||
Fonts, only if the fonts are renamed to names not containing either
|
||||
the words "Tavmjong Bah" or the word "Arev".
|
||||
|
||||
This License becomes null and void to the extent applicable to Fonts
|
||||
or Font Software that has been modified and is distributed under the
|
||||
"Tavmjong Bah Arev" names.
|
||||
|
||||
The Font Software may be sold as part of a larger software package but
|
||||
no copy of one or more of the Font Software typefaces may be sold by
|
||||
itself.
|
||||
|
||||
THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF
|
||||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT
|
||||
OF COPYRIGHT, PATENT, TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL
|
||||
TAVMJONG BAH BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
|
||||
INCLUDING ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL
|
||||
DAMAGES, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
||||
FROM, OUT OF THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM
|
||||
OTHER DEALINGS IN THE FONT SOFTWARE.
|
||||
|
||||
Except as contained in this notice, the name of Tavmjong Bah shall not
|
||||
be used in advertising or otherwise to promote the sale, use or other
|
||||
dealings in this Font Software without prior written authorization
|
||||
from Tavmjong Bah. For further information, contact: tavmjong @ free
|
||||
. fr.
|
||||
|
||||
TeX Gyre DJV Math
|
||||
-----------------
|
||||
Fonts are (c) Bitstream (see below). DejaVu changes are in public domain.
|
||||
|
||||
Math extensions done by B. Jackowski, P. Strzelczyk and P. Pianowski
|
||||
(on behalf of TeX users groups) are in public domain.
|
||||
|
||||
Letters imported from Euler Fraktur from AMSfonts are (c) American
|
||||
Mathematical Society (see below).
|
||||
Bitstream Vera Fonts Copyright
|
||||
Copyright (c) 2003 by Bitstream, Inc. All Rights Reserved. Bitstream Vera
|
||||
is a trademark of Bitstream, Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of the fonts accompanying this license (“Fonts”) and associated
|
||||
documentation
|
||||
files (the “Font Software”), to reproduce and distribute the Font Software,
|
||||
including without limitation the rights to use, copy, merge, publish,
|
||||
distribute,
|
||||
and/or sell copies of the Font Software, and to permit persons to whom
|
||||
the Font Software is furnished to do so, subject to the following
|
||||
conditions:
|
||||
|
||||
The above copyright and trademark notices and this permission notice
|
||||
shall be
|
||||
included in all copies of one or more of the Font Software typefaces.
|
||||
|
||||
The Font Software may be modified, altered, or added to, and in particular
|
||||
the designs of glyphs or characters in the Fonts may be modified and
|
||||
additional
|
||||
glyphs or characters may be added to the Fonts, only if the fonts are
|
||||
renamed
|
||||
to names not containing either the words “Bitstream” or the word “Vera”.
|
||||
|
||||
This License becomes null and void to the extent applicable to Fonts or
|
||||
Font Software
|
||||
that has been modified and is distributed under the “Bitstream Vera”
|
||||
names.
|
||||
|
||||
The Font Software may be sold as part of a larger software package but
|
||||
no copy
|
||||
of one or more of the Font Software typefaces may be sold by itself.
|
||||
|
||||
THE FONT SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS
|
||||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF COPYRIGHT, PATENT,
|
||||
TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL BITSTREAM OR THE GNOME
|
||||
FOUNDATION
|
||||
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, INCLUDING ANY GENERAL,
|
||||
SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, WHETHER IN AN
|
||||
ACTION
|
||||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF THE USE OR
|
||||
INABILITY TO USE
|
||||
THE FONT SOFTWARE OR FROM OTHER DEALINGS IN THE FONT SOFTWARE.
|
||||
Except as contained in this notice, the names of GNOME, the GNOME
|
||||
Foundation,
|
||||
and Bitstream Inc., shall not be used in advertising or otherwise to promote
|
||||
the sale, use or other dealings in this Font Software without prior written
|
||||
authorization from the GNOME Foundation or Bitstream Inc., respectively.
|
||||
For further information, contact: fonts at gnome dot org.
|
||||
|
||||
AMSFonts (v. 2.2) copyright
|
||||
|
||||
The PostScript Type 1 implementation of the AMSFonts produced by and
|
||||
previously distributed by Blue Sky Research and Y&Y, Inc. are now freely
|
||||
available for general use. This has been accomplished through the
|
||||
cooperation
|
||||
of a consortium of scientific publishers with Blue Sky Research and Y&Y.
|
||||
Members of this consortium include:
|
||||
|
||||
Elsevier Science IBM Corporation Society for Industrial and Applied
|
||||
Mathematics (SIAM) Springer-Verlag American Mathematical Society (AMS)
|
||||
|
||||
In order to assure the authenticity of these fonts, copyright will be
|
||||
held by
|
||||
the American Mathematical Society. This is not meant to restrict in any way
|
||||
the legitimate use of the fonts, such as (but not limited to) electronic
|
||||
distribution of documents containing these fonts, inclusion of these fonts
|
||||
into other public domain or commercial font collections or computer
|
||||
applications, use of the outline data to create derivative fonts and/or
|
||||
faces, etc. However, the AMS does require that the AMS copyright notice be
|
||||
removed from any derivative versions of the fonts which have been altered in
|
||||
any way. In addition, to ensure the fidelity of TeX documents using Computer
|
||||
Modern fonts, Professor Donald Knuth, creator of the Computer Modern faces,
|
||||
has requested that any alterations which yield different font metrics be
|
||||
given a different name.
|
||||
|
||||
$Id$
|
||||
7
build/font-licenses/JetBrains_Mono.txt
Normal file
7
build/font-licenses/JetBrains_Mono.txt
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# JetBrains Mono
|
||||
# Downloaded from: https://github.com/JetBrains/JetBrainsMono/raw/master/fonts/ttf/JetBrainsMono-Regular.ttf
|
||||
# License: SIL-OFL-1.1
|
||||
|
||||
# This font was downloaded directly as a pre-built binary file.
|
||||
# For the full license text, please refer to the source repository
|
||||
# and the license identifier specified above.
|
||||
7
build/font-licenses/Roboto.txt
Normal file
7
build/font-licenses/Roboto.txt
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Roboto
|
||||
# Downloaded from: https://github.com/googlefonts/roboto/raw/main/src/hinted/Roboto-Regular.ttf
|
||||
# License: Apache-2.0
|
||||
|
||||
# This font was downloaded directly as a pre-built binary file.
|
||||
# For the full license text, please refer to the source repository
|
||||
# and the license identifier specified above.
|
||||
7
build/font-licenses/Source_Code_Pro.txt
Normal file
7
build/font-licenses/Source_Code_Pro.txt
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Source Code Pro
|
||||
# Downloaded from: https://github.com/adobe-fonts/source-code-pro/raw/release/OTF/SourceCodePro-Regular.otf
|
||||
# License: SIL-OFL-1.1
|
||||
|
||||
# This font was downloaded directly as a pre-built binary file.
|
||||
# For the full license text, please refer to the source repository
|
||||
# and the license identifier specified above.
|
||||
54850
build/glyph-shapes.json
54850
build/glyph-shapes.json
File diff suppressed because it is too large
Load diff
14
build/shape-corpus-manifest.txt
Normal file
14
build/shape-corpus-manifest.txt
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
# Shape Corpus Font Manifest
|
||||
# Format: family_name|url|license_short_id|target_file
|
||||
# The script downloads fonts to build/shape-corpus/ and copies licenses to build/font-licenses/
|
||||
|
||||
# Latin Basic + Extended
|
||||
DejaVu Sans|https://sourceforge.net/projects/dejavu/files/dejavu/2.37/dejavu-fonts-ttf-2.37.zip|SIL-OFL-1.0|DejaVuSans.ttf
|
||||
Roboto|https://github.com/googlefonts/roboto/raw/main/src/hinted/Roboto-Regular.ttf|Apache-2.0|Roboto-Regular.ttf
|
||||
|
||||
# Monospace
|
||||
Source Code Pro|https://github.com/adobe-fonts/source-code-pro/raw/release/OTF/SourceCodePro-Regular.otf|SIL-OFL-1.1|SourceCodePro-Regular.otf
|
||||
JetBrains Mono|https://github.com/JetBrains/JetBrainsMono/raw/master/fonts/ttf/JetBrainsMono-Regular.ttf|SIL-OFL-1.1|JetBrainsMono-Regular.ttf
|
||||
|
||||
# Greek / Cyrillic Support
|
||||
DejaVu Sans|https://sourceforge.net/projects/dejavu/files/dejavu/2.37/dejavu-fonts-ttf-2.37.zip|SIL-OFL-1.0|DejaVuSans.ttf
|
||||
228
scripts/fetch-shape-corpus.sh
Executable file
228
scripts/fetch-shape-corpus.sh
Executable file
|
|
@ -0,0 +1,228 @@
|
|||
#!/usr/bin/env bash
|
||||
#
|
||||
# fetch-shape-corpus.sh - Download open-licensed font corpus for glyph shape DB
|
||||
#
|
||||
# This script downloads fonts from the manifest file and copies their LICENSE
|
||||
# files to build/font-licenses/. The script is idempotent - it skips downloads
|
||||
# for fonts that are already present.
|
||||
#
|
||||
# Usage: bash scripts/fetch-shape-corpus.sh
|
||||
#
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Colors for output
|
||||
readonly GREEN='\033[0;32m'
|
||||
readonly YELLOW='\033[1;33m'
|
||||
readonly NC='\033[0m' # No Color
|
||||
|
||||
log_info() {
|
||||
echo -e "${GREEN}[INFO]${NC} $1"
|
||||
}
|
||||
|
||||
log_skip() {
|
||||
echo -e "${YELLOW}[SKIP]${NC} $1"
|
||||
}
|
||||
|
||||
# Function to download a font
|
||||
# Usage: download_font <family_name> <url> <target_file> <family_slug> <license_id>
|
||||
download_font() {
|
||||
local family_name="$1"
|
||||
local url="$2"
|
||||
local target_file="$3"
|
||||
local family_slug="$4"
|
||||
local license_id="$5"
|
||||
|
||||
# Create temp directory for download
|
||||
local temp_dir
|
||||
temp_dir=$(mktemp -d)
|
||||
trap "rm -rf ${temp_dir}" RETURN
|
||||
|
||||
local filename
|
||||
filename=$(basename "${url}")
|
||||
|
||||
# Download to temp
|
||||
log_info " Fetching ${filename}..."
|
||||
if ! curl -fsSL "${url}" -o "${temp_dir}/${filename}"; then
|
||||
echo " Error: Failed to download ${url}"
|
||||
return 1
|
||||
fi
|
||||
|
||||
local downloaded_file="${temp_dir}/${filename}"
|
||||
local target_path="${CORPUS_DIR}/${target_file}"
|
||||
|
||||
# Handle different file types
|
||||
case "${filename}" in
|
||||
*.zip)
|
||||
# Unzip and find target font
|
||||
unzip -q "${downloaded_file}" -d "${temp_dir}/extracted"
|
||||
find_and_copy_font "${temp_dir}/extracted" "${target_file}" "${target_path}"
|
||||
extract_license_from_archive "${temp_dir}/extracted" "${family_slug}" "${family_name}" "${url}" "${license_id}"
|
||||
;;
|
||||
*.tar.gz|*.tgz)
|
||||
# Extract tar.gz and find target font
|
||||
mkdir -p "${temp_dir}/extracted"
|
||||
tar -xzf "${downloaded_file}" -C "${temp_dir}/extracted"
|
||||
find_and_copy_font "${temp_dir}/extracted" "${target_file}" "${target_path}"
|
||||
extract_license_from_archive "${temp_dir}/extracted" "${family_slug}" "${family_name}" "${url}" "${license_id}"
|
||||
;;
|
||||
*.ttf|*.otf)
|
||||
# Direct font file - just copy
|
||||
mkdir -p "$(dirname "${target_path}")"
|
||||
cp "${downloaded_file}" "${target_path}"
|
||||
log_info " Installed: ${target_file}"
|
||||
# For direct downloads, we can't extract LICENSE from the archive
|
||||
# Create a placeholder license file with download URL
|
||||
cat > "${LICENSE_DIR}/${family_slug}.txt" <<EOF
|
||||
# ${family_name}
|
||||
# Downloaded from: ${url}
|
||||
# License: ${license_id}
|
||||
|
||||
# This font was downloaded directly as a pre-built binary file.
|
||||
# For the full license text, please refer to the source repository
|
||||
# and the license identifier specified above.
|
||||
EOF
|
||||
log_info " License: ${LICENSE_DIR}/${family_slug}.txt"
|
||||
;;
|
||||
*)
|
||||
echo " Error: Unknown file type: ${filename}"
|
||||
return 1
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# Function to find and copy font from extracted directory
|
||||
find_and_copy_font() {
|
||||
local search_dir="$1"
|
||||
local target_file="$2"
|
||||
local target_path="$3"
|
||||
|
||||
# Recursively find the font file
|
||||
local found
|
||||
found=$(find "${search_dir}" -type f \( -name "${target_file}" -o -name "${target_file%.ttf}.otf" -o -name "${target_file%.otf}.ttf" \) | head -1)
|
||||
|
||||
if [[ -z "${found}" ]]; then
|
||||
echo " Warning: ${target_file} not found in archive, searching for similar..."
|
||||
# Try to find any .ttf or .otf file in the archive
|
||||
found=$(find "${search_dir}" -type f \( -name "*.ttf" -o -name "*.otf" \) | head -1)
|
||||
if [[ -z "${found}" ]]; then
|
||||
echo " Error: No font files found in archive"
|
||||
return 1
|
||||
fi
|
||||
echo " Using alternative: $(basename "${found}")"
|
||||
fi
|
||||
|
||||
# Create target directory if needed
|
||||
mkdir -p "$(dirname "${target_path}")"
|
||||
cp "${found}" "${target_path}"
|
||||
log_info " Installed: ${target_file}"
|
||||
}
|
||||
|
||||
# Function to extract LICENSE from archive
|
||||
extract_license_from_archive() {
|
||||
local search_dir="$1"
|
||||
local family_slug="$2"
|
||||
local family_name="$3"
|
||||
local url="$4"
|
||||
local license_id="$5"
|
||||
|
||||
# Look for common license file names
|
||||
local license_file
|
||||
license_file=$(find "${search_dir}" -type f \( -name "LICENSE" -o -name "LICENSE.txt" -o -name "OFL.txt" -o -name "OFL-*.txt" \) | head -1)
|
||||
|
||||
if [[ -n "${license_file}" ]]; then
|
||||
cp "${license_file}" "${LICENSE_DIR}/${family_slug}.txt"
|
||||
log_info " License: ${LICENSE_DIR}/${family_slug}.txt"
|
||||
else
|
||||
# Create a placeholder if no license found
|
||||
cat > "${LICENSE_DIR}/${family_slug}.txt" <<EOF
|
||||
# ${family_name}
|
||||
# Downloaded from: ${url}
|
||||
# License: ${license_id}
|
||||
|
||||
# License file not found in archive. Please refer to the source repository
|
||||
# for the full license text corresponding to: ${license_id}
|
||||
EOF
|
||||
log_info " License: Placeholder (${license_id})"
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to extract license from already-present font file
|
||||
# This is used when skipping downloads
|
||||
extract_license() {
|
||||
local target_file="$1"
|
||||
local family_slug="$2"
|
||||
local family_name="$3"
|
||||
local license_id="$4"
|
||||
|
||||
# Check if license already exists
|
||||
if [[ -f "${LICENSE_DIR}/${family_slug}.txt" ]]; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
# Create a placeholder license file
|
||||
cat > "${LICENSE_DIR}/${family_slug}.txt" <<EOF
|
||||
# ${family_name}
|
||||
# Source: ${target_file}
|
||||
# License: ${license_id}
|
||||
|
||||
# This font file was already present in the corpus directory.
|
||||
# For the full license text, please refer to the source repository
|
||||
# and the license identifier specified above.
|
||||
EOF
|
||||
}
|
||||
|
||||
# Main script
|
||||
# =============
|
||||
|
||||
# Paths
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
WORKSPACE_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
|
||||
MANIFEST_FILE="${WORKSPACE_ROOT}/build/shape-corpus-manifest.txt"
|
||||
CORPUS_DIR="${WORKSPACE_ROOT}/build/shape-corpus"
|
||||
LICENSE_DIR="${WORKSPACE_ROOT}/build/font-licenses"
|
||||
|
||||
# Create directories
|
||||
mkdir -p "${CORPUS_DIR}"
|
||||
mkdir -p "${LICENSE_DIR}"
|
||||
|
||||
# Check if manifest exists
|
||||
if [[ ! -f "${MANIFEST_FILE}" ]]; then
|
||||
echo "Error: Manifest file not found: ${MANIFEST_FILE}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Read manifest and download fonts
|
||||
# Skip comments and empty lines
|
||||
while IFS='|' read -r family_name url license_id target_file; do
|
||||
# Skip comments and empty lines
|
||||
[[ "${family_name}" =~ ^#.*$ ]] && continue
|
||||
[[ -z "${family_name}" ]] && continue
|
||||
|
||||
# Normalize family name for filename (replace spaces with underscores)
|
||||
family_slug=$(echo "${family_name}" | tr ' ' '_' | tr -cd '[:alnum:]_')
|
||||
target_path="${CORPUS_DIR}/${target_file}"
|
||||
|
||||
# Skip if already downloaded
|
||||
if [[ -f "${target_path}" ]]; then
|
||||
log_skip "${family_name} - already present"
|
||||
# Still copy LICENSE if missing
|
||||
if [[ ! -f "${LICENSE_DIR}/${family_slug}.txt" ]]; then
|
||||
log_info "Extracting LICENSE for ${family_name}..."
|
||||
extract_license "${target_path}" "${family_slug}" "${family_name}" "${license_id}" || true
|
||||
fi
|
||||
continue
|
||||
fi
|
||||
|
||||
log_info "Downloading ${family_name}..."
|
||||
download_font "${family_name}" "${url}" "${target_file}" "${family_slug}" "${license_id}"
|
||||
|
||||
done < "${MANIFEST_FILE}"
|
||||
|
||||
echo ""
|
||||
log_info "Font corpus download complete!"
|
||||
echo " Corpus dir: ${CORPUS_DIR}"
|
||||
echo " License dir: ${LICENSE_DIR}"
|
||||
echo ""
|
||||
log_info "To generate the shape database, run:"
|
||||
echo " cargo xtask gen-shape-db ${CORPUS_DIR}"
|
||||
|
|
@ -1752,13 +1752,17 @@ fn center_bitmap_32x32(bitmap: &[u8], width: usize, height: usize) -> [u8; 1024]
|
|||
return centered;
|
||||
}
|
||||
|
||||
// Clamp dimensions to 32x32 (crop larger glyphs)
|
||||
let clamped_width = width.min(32);
|
||||
let clamped_height = height.min(32);
|
||||
|
||||
// Calculate offsets to center the bitmap
|
||||
let x_offset = (32 - width) / 2;
|
||||
let y_offset = (32 - height) / 2;
|
||||
let x_offset = (32 - clamped_width) / 2;
|
||||
let y_offset = (32 - clamped_height) / 2;
|
||||
|
||||
// Copy bitmap into centered position
|
||||
for y in 0..height.min(32) {
|
||||
for x in 0..width.min(32) {
|
||||
for y in 0..clamped_height {
|
||||
for x in 0..clamped_width {
|
||||
let src_idx = y * width + x;
|
||||
if src_idx < bitmap.len() {
|
||||
let dst_y = y_offset + y;
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue