History

jedarden d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS		2026-06-07 13:43:19 -04:00
..
font-licenses	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
shape-corpus	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
CHECKSUMS.sha256	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
font-fingerprints.json	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
frequency.json	feat(xtask): implement gen-shape-db subcommand for glyph pHash database	2026-05-24 05:40:44 -04:00
gen_fingerprint_entry.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
glyph-shapes.json	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
README.md	feat(xtask): implement gen-shape-db subcommand for glyph pHash database	2026-05-24 05:40:44 -04:00
shape-corpus-manifest.txt	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00

README.md

Glyph Shape Database Generation

Overview

The cargo xtask gen-shape-db command generates a perceptual hash (pHash) database from TrueType/OpenType font files. This database is used for glyph shape recognition in PDF text extraction.

Usage

# From workspace root
cargo xtask gen-shape-db <fonts-dir> [output-path]

# Example
cargo xtask gen-shape-db /path/to/fonts build/glyph-shapes.json

Arguments

fonts-dir: Path to directory containing .ttf or .otf font files (recursively searched)
output-path: Optional output path (default: build/glyph-shapes.json)

Font Requirements

Fonts MUST be open-licensed:

Google Fonts (Apache 2.0 / OFL)
SIL Open Font License fonts
Other permissive licenses compatible with PDF extraction

Output Format

The output is a JSON array of glyph entries:

[
  {
    "phash_hex": "0123456789abcdef",
    "char": "A",
    "source_font": "LiberationSans-Regular.ttf",
    "frequency_rank": 30
  },
  ...
]

Character Frequency

The command reads build/frequency.json for character frequency rankings. If not found, all characters are assigned rank 0.

Format: {"A": 30, "B": 47, ...} where higher values = more common.

Suggested Fonts

For comprehensive coverage, use these open-licensed fonts:

Liberation Sans
DejaVu Sans
Source Code Pro
Noto Sans (covers Latin, Greek, Cyrillic)
Roboto

Example Setup

# Download Google Fonts
git clone https://github.com/google/fonts.git /tmp/fonts

# Generate database
cargo xtask gen-shape-db /tmp/fonts/ofl/liberationsans build/glyph-shapes.json

# Expected: ~5000 glyphs covering Latin, Greek, Cyrillic, symbols

License Attribution

Font license texts should be stored in build/font-licenses/ with a README.md documenting the source and license terms for each font used.

Algorithm

Load each font file using fontdue
For each Unicode codepoint (0x0000-0xFFFF):
- Check if font has a glyph for the character
- Rasterize at 32x32 pixels
- Center the bitmap on a 32x32 canvas
- Compute pHash via 32x32 DCT → 8x8 low-freq coefficients → median threshold
Deduplicate by (pHash, char) pairs
Handle cross-character collisions by keeping higher-frequency character
Sort by pHash ascending and output JSON

Determinism

The output is byte-identical when re-run on the same input fonts and frequency data.