pdftract/notes/pdftract-3s2i.md
jedarden e6bf3dd290 feat(pdftract-3s2i): implement Phase 5.5.2 validation filter
Implement per-word validation filter for assisted-OCR BrokenVector path.

Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
  - 5pt distance threshold for position validation
  - 0.4 confidence cap for rejected words
  - Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter

Closes: pdftract-3s2i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 04:57:17 -04:00

61 lines
2.9 KiB
Markdown

# pdftract-3s2i: Phase 5.5.2 Validation Filter Implementation
## Summary
Implemented the per-word validation filter for the assisted-OCR BrokenVector path (Phase 5.5.2). The filter validates each Tesseract word result against the nearest vector glyph bbox center and adjusts confidence accordingly.
## Changes Made
### 1. Added `SpanSource::OcrAssisted` variant (crates/pdftract-core/src/hybrid.rs)
- Extended the `SpanSource` enum to include `OcrAssisted` for position-validated OCR spans
- Added `Span::ocr_assisted()` helper method
### 2. Implemented validation filter (crates/pdftract-core/src/ocr.rs)
- Added `validate_ocr_with_position_hints()` function
- Constants:
- `ASSISTED_OCR_DISTANCE_PT = 5.0` (distance threshold in PDF points)
- `ASSISTED_OCR_CONFIDENCE_CAP = 0.4` (confidence cap for rejected words)
- `ASSISTED_OCR_KDTREE_THRESHOLD = 100` (glyph count for KD-tree optimization)
- Algorithm:
1. Extract vector glyph bbox centers from position hints
2. For each OCR word: compute word center and find nearest glyph center
3. If distance < 5pt: accept with full OCR confidence
4. If distance >= 5pt: cap confidence at 0.4
5. Return `Vec<Span>` with `SpanSource::OcrAssisted`
### 3. Unit tests (assisted_ocr_tests module)
- `test_validation_filter_near_glyph`: Words near glyphs get full confidence
- `test_validation_filter_far_from_glyph`: Words far from glyphs are capped at 0.4
- `test_validation_filter_confidence_already_below_cap`: Low-confidence words stay as-is
- `test_validation_filter_no_glyphs`: No position hints → all words capped
- `test_validation_filter_multiple_words_preserves_order`: HOCR document order preserved
- `test_validation_filter_distance_threshold`: 5pt boundary behavior
- `test_assisted_ocr_constants`: Verify constants match spec
## Acceptance Criteria
### PASS
- ✅ Unit test: vector glyph at (100, 200); Tesseract word at (102, 201) → accepted full conf
- ✅ Unit test: word at (110, 210) (distance > 5 pt) → cap at 0.4
- ✅ Reproducibility: same inputs → identical Span outputs
- ✅ Code compiles: `cargo check --all-targets` passes
- ✅ Code formatted: `cargo fmt` applied
### WARN (environmental issues, out of scope)
- ⚠️ Critical-fixture test (PDF/A with invisible text layer) requires OCR feature + Tesseract installation
- ⚠️ WER comparison tests require full integration pipeline
### FAIL (true blockers)
- None
## Technical Notes
- Performance: Linear scan O(N*M) is used for now; KD-tree optimization (O(N*log(M))) is deferred until N > 100 glyphs
- The 5pt threshold is approximately one space-character width at 12pt font
- The 0.4 confidence cap is below the 0.5 threshold used in bbox-merge (Phase 5.2.4), ensuring unassisted OCR won't override legitimate vector spans
- HOCR document order is preserved in the output
## References
- Plan section: Phase 5.5 pipeline step 3 (line 1935)
- Bead ID: pdftract-3s2i