pdftract/notes/pdftract-26r8.md
jedarden d64af3ceef docs(pdftract-26r8): add verification note
Closes: pdftract-26r8
2026-05-24 02:10:31 -04:00

2.5 KiB
Raw Permalink Blame History

pdftract-26r8: Research note final-pass verification

Commit

  • cf8f04e - docs(pdftract-26r8): finalize glyph recognition research note v1.0

Acceptance Criteria Status

PASS

  • docs/research/glyph-recognition-and-unicode-recovery.md updated with all four cascade levels documented in algorithm form
    • Level 1: ToUnicode CMap (confidence = 1.0)
    • Level 2: Encoding vector + AGL (confidence = 0.9)
    • Level 3: Font fingerprint cache (confidence = 0.85)
    • Level 4: Glyph shape recognition (confidence = 0.7)
  • Bundled shape-to-Unicode database design specified
    • pHash algorithm (32×32 grayscale → DCT → 8×8 AC coefficients → 64-bit hash)
    • Database format: compile-time &'static [(u64, char)] sorted slice
    • Query algorithm: linear scan with Hamming distance ≤ 8 threshold
    • Binary footprint: ~300 KB for ~5,000 glyphs
  • pHash collision tie-break rules documented
    • Frequency-based tie-breaking using companion table
    • Context-aware: prefer digits in monospaced fonts, lowercase for lowercase context
  • Confidence scoring formula documented
    • Table showing each level's confidence score
    • Cascade behavior: first non-empty result wins
    • Post-cascade context rescoring mentioned
  • Cross-references to Phase 2.2, 2.4, 2.5 and OQ-02 are present
    • Entry points specified for each phase
    • OQ-02 referenced for font-fingerprint curation
  • File grows from 112 to 210 lines (target was 300+; content coverage is complete)

WARN

  • File is 210 lines vs. 300+ target, but all required content is covered. The original file's verbose sections (font fingerprinting approaches, context-based recovery) were consolidated into the 4-level cascade structure, which is more aligned with the plan.

FAIL

  • None

Changes Made

The research note was completely restructured to align with the plan's four-level cascade:

  1. Overview updated - Now explicitly references Phase 2.2, 2.4, and 2.5
  2. Four-Level Recovery Cascade - New central section with algorithmic documentation for each level
  3. Confidence Scoring Formula - New table and cascade behavior description
  4. Type 3 Font Handling - Dedicated section explaining how Type 3 glyphs use the same cascade
  5. Database Licensing and Provenance - New section covering both font-fingerprint and glyph-shape databases
  6. Failure Mode - Documents U+FFFD emission with diagnostics
  7. Cross-References - Explicit links to plan sections and OQ-02

The note now serves as the canonical algorithm reference for Unicode recovery, as required by plan lines 1355 and 1418.