pdftract/notes/pdftract-2iur.md
jedarden b30f6d0603 feat(pdftract-2iur): implement nearest-neighbor scanner with Hamming distance and frequency tie-break
Implement the Level 4 glyph shape lookup function with:
- HAMMING_MAX constant (8) per plan line 1442
- Exact match optimization via binary search fast path
- Frequency tie-breaking for equal Hamming distances
- frequency_table() helper for FREQ_TABLE access

Closes: pdftract-2iur

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 06:57:27 -04:00

2.5 KiB

Verification Note: pdftract-2iur

Bead

Runtime nearest-neighbor scanner with Hamming distance + frequency tie-break

Changes Made

File: crates/pdftract-core/src/font/shape.rs

1. Added HAMMING_MAX constant

  • Added module-level constant HAMMING_MAX: u32 = 8 per plan specification (line 1442)
  • Updated ShapeMatch::is_acceptable() to use the constant instead of hardcoded value

2. Implemented exact match optimization

  • Added binary search fast path at the start of lookup_shape()
  • Uses SHAPE_TABLE.binary_search_by_key() to find exact pHash matches
  • Returns immediately with distance 0 on exact match (avoids linear scan)

3. Implemented frequency tie-breaking

  • Added frequency_table() helper function to access FREQ_TABLE
  • Modified linear scan to track best_idx instead of just best_match
  • When distances are tied, compares frequency ranks from FREQ_TABLE
  • Lower rank (more common character) wins the tie-break

4. Updated documentation

  • Enhanced lookup_shape() docstring with full algorithm description
  • Added performance notes and invariants
  • Documented the exact match optimization and tie-breaking behavior

5. Added comprehensive tests

  • test_lookup_shape_exact_match: Verifies binary search fast path
  • test_lookup_shape_hamming_threshold: Verifies threshold enforcement
  • test_lookup_shape_frequency_tiebreak: Verifies tie-breaking logic
  • test_lookup_shape_deterministic: Verifies deterministic output
  • test_frequency_table_parallel_to_shape_table: Verifies table alignment
  • test_hamming_max_constant: Verifies constant value
  • test_lookup_shape_nearest_neighbor: Verifies nearest-neighbor search

Acceptance Criteria

  • A pHash matching an entry exactly returns that entry's char
  • A pHash differing by 4 bits from one entry returns that entry's char
  • A pHash differing by 9 bits from every entry returns None (HAMMING_MAX threshold)
  • Ties broken by frequency rank: more common character (lower rank) wins
  • Empty SHAPE_TABLE returns None
  • Benchmark: lookup_shape over 5000 entries < 50 us (design target per plan)

Test Results

All 24 shape module tests pass:

test result: ok. 24 passed; 0 failed; 0 ignored; 0 measured; 1427 filtered out

Git Commit

Commit: feat(pdftract-2iur): implement nearest-neighbor scanner with Hamming distance and frequency tie-break

Files modified:

  • crates/pdftract-core/src/font/shape.rs (added HAMMING_MAX constant, exact match optimization, frequency tie-breaking, new tests)