Implement Phase 5.2.4 Hybrid page handling: - OcrCallback trait for OCR abstraction - process_hybrid_page() main entry point - Cell rendering: render once, crop per cell - Merge rule: IoU > 0.5 + vector_conf >= 0.5 -> vector wins Tests: - OCR runs only on scanned cells (48 not 64) - IoU 0.6 -> vector kept - IoU 0.3 -> both kept - IoU 0.6 + low vector conf -> OCR kept - No duplicate text from overlap All 40 hybrid tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.5 KiB
4.5 KiB
pdftract-4y9l: Hybrid Page Routing Implementation
Summary
Implemented Phase 5.2.4 Hybrid page handling pipeline with per-cell OCR routing and bbox overlap merge rule.
Changes Made
File: crates/pdftract-core/src/hybrid.rs
Added imports:
PageClassto fix compilation error
Added types and functions:
OcrCallbacktrait - Abstracts OCR implementation (Phase 5.3 preprocessing + 5.4 Tesseract)MockOcrCallbackstruct - Mock OCR callback for testing that tracks call countsprocess_hybrid_page()- Main entry point for hybrid page handling
Added tests:
test_process_hybrid_page_ocr_only_on_scanned_cells()- Verifies OCR runs only on scanned cells (48 cells, not 64)test_process_hybrid_page_no_duplicate_text_from_overlap()- Verifies no duplicate text from overlapping regionstest_process_hybrid_page_low_vector_confidence_ocr_wins()- Verifies OCR preferred over low-confidence vectortest_process_hybrid_page_non_hybrid_classification()- Verifies non-hybrid pages skip OCRtest_process_hybrid_page_empty_hybrid_cells()- Verifies empty hybrid_cells skips OCR
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
| OCR runs only on scanned cells | PASS | Test test_process_hybrid_page_ocr_only_on_scanned_cells verifies 48 calls for 6 rows, not 64 for full page |
| Merge unit tests (IoU 0.6, 0.3, 0.6 with low confidence) | PASS | Tests test_merge_iou_06_vector_kept, test_merge_iou_03_both_kept, test_merge_iou_06_low_vector_confidence_ocr_kept |
| No duplicate text from overlap | PASS | Test test_process_hybrid_page_no_duplicate_text_from_overlap verifies single span after merge |
| Performance (Hybrid < Scanned by 30%) | WARN | Performance criterion noted; requires integration benchmark with actual PDF fixture |
Implementation Details
Cell Rendering Strategy
- Render full page once at selected DPI
- Crop per cell from rendered raster (cheaper than re-rendering)
- Cell dimensions:
cell_w = page_w_px / 8,cell_h = page_h_px / 8 - Cell coordinates:
[c*cell_w, r*cell_h, (c+1)*cell_w, (r+1)*cell_h]
Merge Rule (IoU-based)
- For each OCR span O: find vector span V with IoU(O.bbox, V.bbox) > 0.5
- If found AND V.confidence >= 0.5: drop O (vector wins)
- If found AND V.confidence < 0.5: keep O (OCR preferred over bad vector)
- If not found: keep O
- Return all V + retained O sorted by reading order
IoU Formula
IoU = area(A ∩ B) / area(A ∪ B)
Reading Order
Spans sorted top-to-bottom, left-to-right (descending Y, then ascending X in PDF coordinates)
Test Results
All 40 hybrid tests pass:
running 40 tests
test hybrid::tests::test_compute_cell_crops ... ok
test hybrid::tests::test_compute_iou_contained ... ok
test hybrid::tests::test_compute_iou_half_overlap ... ok
test hybrid::tests::test_compute_iou_identical ... ok
test hybrid::tests::test_compute_iou_no_overlap ... ok
test hybrid::tests::test_get_hybrid_cells_non_hybrid ... ok
test hybrid::tests::test_get_hybrid_cells_with_cells ... ok
test hybrid::tests::test_merge_iou_03_both_kept ... ok
test hybrid::tests::test_merge_iou_06_low_vector_confidence_ocr_kept ... ok
test hybrid::tests::test_merge_iou_06_vector_kept ... ok
test hybrid::tests::test_merge_multiple_ocr_spans ... ok
test hybrid::tests::test_merge_no_overlap ... ok
test hybrid::tests::test_merge_reading_order ... ok
test hybrid::tests::test_merge_sorting ... ok
test hybrid::tests::test_process_hybrid_page_empty_hybrid_cells ... ok
test hybrid::tests::test_crop_cell_from_page ... ok
test hybrid::tests::test_process_hybrid_page_low_vector_confidence_ocr_wins ... ok
test hybrid::tests::test_process_hybrid_page_non_hybrid_classification ... ok
test hybrid::tests::test_process_hybrid_page_no_duplicate_text_from_overlap ... ok
test hybrid::tests::test_span_dimensions ... ok
test hybrid::tests::test_span_new ... ok
test hybrid::tests::test_span_ocr ... ok
test hybrid::tests::test_span_source_equality ... ok
test hybrid::tests::test_span_vector ... ok
test hybrid::tests::test_process_hybrid_page_ocr_only_on_scanned_cells ... ok
test result: ok. 40 passed; 0 failed; 0 ignored; 0 measured; 929 filtered out; finished in 0.06s
Reusable Patterns
- Callback trait for external dependency:
OcrCallbacktrait abstracts Tesseract dependency for testing - Atomic call tracking:
Arc<AtomicUsize>for counting calls across test boundaries - Cell-based grid processing: 8x8 grid with flat index mapping
(row, col) -> row*8 + col