Add cargo xtask gen-shape-db command that walks font directories,
rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs
build/glyph-shapes.json.
Implementation details:
- Fontdue integration for TrueType/OpenType font loading
- 32x32 bitmap rasterization with centering
- DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold)
- Character frequency data for collision resolution
- Deduplication by (phash, char) pairs
- Cross-character collision handling (keep higher-frequency char)
- Sorted output by pHash ascending
Artifacts:
- build/frequency.json: Character frequency rankings
- build/README.md: Command documentation and usage
Acceptance criteria:
- ✅ cargo xtask gen-shape-db --fonts <dir> produces valid JSON
- ✅ Deterministic output (byte-identical on same inputs)
- ✅ Fontdue integration and 32x32 rasterization
- ✅ pHash computation via DCT
- ⚠️ No system fonts for full integration test (documented)
Closes: pdftract-2aq0
Implement per-word validation filter for assisted-OCR BrokenVector path.
Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
- 5pt distance threshold for position validation
- 0.4 confidence cap for rejected words
- Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter
Closes: pdftract-3s2i
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.
Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields
Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences
All 27 tests pass including proptests for fuzzing robustness (INV-8).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>