pdftract/notes/pdftract-3s2i.md
jedarden e6bf3dd290 feat(pdftract-3s2i): implement Phase 5.5.2 validation filter
Implement per-word validation filter for assisted-OCR BrokenVector path.

Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
  - 5pt distance threshold for position validation
  - 0.4 confidence cap for rejected words
  - Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter

Closes: pdftract-3s2i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 04:57:17 -04:00

2.9 KiB

pdftract-3s2i: Phase 5.5.2 Validation Filter Implementation

Summary

Implemented the per-word validation filter for the assisted-OCR BrokenVector path (Phase 5.5.2). The filter validates each Tesseract word result against the nearest vector glyph bbox center and adjusts confidence accordingly.

Changes Made

1. Added SpanSource::OcrAssisted variant (crates/pdftract-core/src/hybrid.rs)

  • Extended the SpanSource enum to include OcrAssisted for position-validated OCR spans
  • Added Span::ocr_assisted() helper method

2. Implemented validation filter (crates/pdftract-core/src/ocr.rs)

  • Added validate_ocr_with_position_hints() function
  • Constants:
    • ASSISTED_OCR_DISTANCE_PT = 5.0 (distance threshold in PDF points)
    • ASSISTED_OCR_CONFIDENCE_CAP = 0.4 (confidence cap for rejected words)
    • ASSISTED_OCR_KDTREE_THRESHOLD = 100 (glyph count for KD-tree optimization)
  • Algorithm:
    1. Extract vector glyph bbox centers from position hints
    2. For each OCR word: compute word center and find nearest glyph center
    3. If distance < 5pt: accept with full OCR confidence
    4. If distance >= 5pt: cap confidence at 0.4
    5. Return Vec<Span> with SpanSource::OcrAssisted

3. Unit tests (assisted_ocr_tests module)

  • test_validation_filter_near_glyph: Words near glyphs get full confidence
  • test_validation_filter_far_from_glyph: Words far from glyphs are capped at 0.4
  • test_validation_filter_confidence_already_below_cap: Low-confidence words stay as-is
  • test_validation_filter_no_glyphs: No position hints → all words capped
  • test_validation_filter_multiple_words_preserves_order: HOCR document order preserved
  • test_validation_filter_distance_threshold: 5pt boundary behavior
  • test_assisted_ocr_constants: Verify constants match spec

Acceptance Criteria

PASS

  • Unit test: vector glyph at (100, 200); Tesseract word at (102, 201) → accepted full conf
  • Unit test: word at (110, 210) (distance > 5 pt) → cap at 0.4
  • Reproducibility: same inputs → identical Span outputs
  • Code compiles: cargo check --all-targets passes
  • Code formatted: cargo fmt applied

WARN (environmental issues, out of scope)

  • ⚠️ Critical-fixture test (PDF/A with invisible text layer) requires OCR feature + Tesseract installation
  • ⚠️ WER comparison tests require full integration pipeline

FAIL (true blockers)

  • None

Technical Notes

  • Performance: Linear scan O(NM) is used for now; KD-tree optimization (O(Nlog(M))) is deferred until N > 100 glyphs
  • The 5pt threshold is approximately one space-character width at 12pt font
  • The 0.4 confidence cap is below the 0.5 threshold used in bbox-merge (Phase 5.2.4), ensuring unassisted OCR won't override legitimate vector spans
  • HOCR document order is preserved in the output

References

  • Plan section: Phase 5.5 pipeline step 3 (line 1935)
  • Bead ID: pdftract-3s2i