The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS |
||
|---|---|---|
| .. | ||
| font-licenses | ||
| shape-corpus | ||
| CHECKSUMS.sha256 | ||
| font-fingerprints.json | ||
| frequency.json | ||
| gen_fingerprint_entry.py | ||
| glyph-shapes.json | ||
| README.md | ||
| shape-corpus-manifest.txt | ||
Glyph Shape Database Generation
Overview
The cargo xtask gen-shape-db command generates a perceptual hash (pHash) database
from TrueType/OpenType font files. This database is used for glyph shape recognition
in PDF text extraction.
Usage
# From workspace root
cargo xtask gen-shape-db <fonts-dir> [output-path]
# Example
cargo xtask gen-shape-db /path/to/fonts build/glyph-shapes.json
Arguments
fonts-dir: Path to directory containing.ttfor.otffont files (recursively searched)output-path: Optional output path (default:build/glyph-shapes.json)
Font Requirements
Fonts MUST be open-licensed:
- Google Fonts (Apache 2.0 / OFL)
- SIL Open Font License fonts
- Other permissive licenses compatible with PDF extraction
Output Format
The output is a JSON array of glyph entries:
[
{
"phash_hex": "0123456789abcdef",
"char": "A",
"source_font": "LiberationSans-Regular.ttf",
"frequency_rank": 30
},
...
]
Character Frequency
The command reads build/frequency.json for character frequency rankings.
If not found, all characters are assigned rank 0.
Format: {"A": 30, "B": 47, ...} where higher values = more common.
Suggested Fonts
For comprehensive coverage, use these open-licensed fonts:
- Liberation Sans
- DejaVu Sans
- Source Code Pro
- Noto Sans (covers Latin, Greek, Cyrillic)
- Roboto
Example Setup
# Download Google Fonts
git clone https://github.com/google/fonts.git /tmp/fonts
# Generate database
cargo xtask gen-shape-db /tmp/fonts/ofl/liberationsans build/glyph-shapes.json
# Expected: ~5000 glyphs covering Latin, Greek, Cyrillic, symbols
License Attribution
Font license texts should be stored in build/font-licenses/ with a README.md
documenting the source and license terms for each font used.
Algorithm
- Load each font file using fontdue
- For each Unicode codepoint (0x0000-0xFFFF):
- Check if font has a glyph for the character
- Rasterize at 32x32 pixels
- Center the bitmap on a 32x32 canvas
- Compute pHash via 32x32 DCT → 8x8 low-freq coefficients → median threshold
- Deduplicate by (pHash, char) pairs
- Handle cross-character collisions by keeping higher-frequency character
- Sort by pHash ascending and output JSON
Determinism
The output is byte-identical when re-run on the same input fonts and frequency data.