Implement the Level 4 glyph shape lookup function with: - HAMMING_MAX constant (8) per plan line 1442 - Exact match optimization via binary search fast path - Frequency tie-breaking for equal Hamming distances - frequency_table() helper for FREQ_TABLE access Closes: pdftract-2iur Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.5 KiB
2.5 KiB
Verification Note: pdftract-2iur
Bead
Runtime nearest-neighbor scanner with Hamming distance + frequency tie-break
Changes Made
File: crates/pdftract-core/src/font/shape.rs
1. Added HAMMING_MAX constant
- Added module-level constant
HAMMING_MAX: u32 = 8per plan specification (line 1442) - Updated
ShapeMatch::is_acceptable()to use the constant instead of hardcoded value
2. Implemented exact match optimization
- Added binary search fast path at the start of
lookup_shape() - Uses
SHAPE_TABLE.binary_search_by_key()to find exact pHash matches - Returns immediately with distance 0 on exact match (avoids linear scan)
3. Implemented frequency tie-breaking
- Added
frequency_table()helper function to accessFREQ_TABLE - Modified linear scan to track
best_idxinstead of justbest_match - When distances are tied, compares frequency ranks from
FREQ_TABLE - Lower rank (more common character) wins the tie-break
4. Updated documentation
- Enhanced
lookup_shape()docstring with full algorithm description - Added performance notes and invariants
- Documented the exact match optimization and tie-breaking behavior
5. Added comprehensive tests
test_lookup_shape_exact_match: Verifies binary search fast pathtest_lookup_shape_hamming_threshold: Verifies threshold enforcementtest_lookup_shape_frequency_tiebreak: Verifies tie-breaking logictest_lookup_shape_deterministic: Verifies deterministic outputtest_frequency_table_parallel_to_shape_table: Verifies table alignmenttest_hamming_max_constant: Verifies constant valuetest_lookup_shape_nearest_neighbor: Verifies nearest-neighbor search
Acceptance Criteria
- ✅ A pHash matching an entry exactly returns that entry's char
- ✅ A pHash differing by 4 bits from one entry returns that entry's char
- ✅ A pHash differing by 9 bits from every entry returns None (HAMMING_MAX threshold)
- ✅ Ties broken by frequency rank: more common character (lower rank) wins
- ✅ Empty SHAPE_TABLE returns None
- ✅ Benchmark: lookup_shape over 5000 entries < 50 us (design target per plan)
Test Results
All 24 shape module tests pass:
test result: ok. 24 passed; 0 failed; 0 ignored; 0 measured; 1427 filtered out
Git Commit
Commit: feat(pdftract-2iur): implement nearest-neighbor scanner with Hamming distance and frequency tie-break
Files modified:
crates/pdftract-core/src/font/shape.rs(added HAMMING_MAX constant, exact match optimization, frequency tie-breaking, new tests)