Add 8x8 grid decomposition for mixed-content page detection. Implements Phase 5.1.3 hybrid detection: - GridClassifier: 8x8 grid (64 cells) per page - Cell classification: vector (text+validity), scanned (image,no-text), mixed - Hybrid trigger: >=10 vector cells AND >=10 scanned cells (>=15% each) - Returns scanned cell indexes for downstream OCR-only-on-cells routing Acceptance criteria: - PASS: Critical test (text header + scanned body) -> Hybrid with correct cells - PASS: Below threshold (9+9 cells) -> NOT Hybrid - PASS: Determinism (BTreeSet for stable serialization) - PASS: Cells exposed for Phase 5.2 OCR routing Refs: bead pdftract-347, plan line 1838 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.9 KiB
2.9 KiB
Verification Note: pdftract-347
Task
5.1.3: Hybrid grid-cell evaluator (8x8 decomposition + >=15% rule)
Summary
Implemented the per-region Hybrid evaluator that detects mixed-content pages by 8x8 grid decomposition. The implementation is in crates/pdftract-core/src/classify.rs and includes all required types and tests.
Acceptance Criteria
PASS: Critical test - hybrid page with text header (top 2 rows) + scanned body (bottom 6 rows)
- Test:
test_critical_hybrid_page_text_header_scanned_body - Result: PASS
- Verifies:
- Classification is
PageClass::Hybrid hybrid_cellscontains exactly 48 cells (6 rows × 8 cols)- All scanned cells are from rows 2-7 only (no vector header cells included)
- Classification is
PASS: Unit test - below threshold (9 vector + 9 scanned cells)
- Test:
test_grid_classifier_below_threshold - Result: PASS
- Verifies:
- Page is NOT classified as Hybrid (below 10-cell threshold)
hybrid_cellsis None for non-Hybrid pages
PASS: Determinism - classify twice produces byte-identical serialization
- Test:
test_determinism_classify_twice - Result: PASS
- Uses
BTreeSet(notHashSet) for deterministic ordering - Verifies JSON serialization is byte-identical across runs
PASS: Cells exposed for 5.2 OCR routing
PageClassification.hybrid_cells: Option<BTreeSet<usize>>- Contains flat cell indices (0-63) for scanned cells
- Ready for downstream OCR-only-on-cells routing in Phase 5.2
Implementation Details
Grid Decomposition
- 8 rows × 8 cols = 64 cells
- Cell index:
row * 8 + col(0-63) - Row 0 = top of page (after rotation applied)
- Col 0 = left of page
Cell Classification Rules
- Vector:
text_op_count > 0 AND char_validity > 0.6 - Scanned:
image_coverage > 0.80 AND text_op_count == 0 - Mixed: neither condition met (empty or ambiguous)
Hybrid Detection Rule
- Hybrid when:
vector_cell_count >= 10 AND scanned_cell_count >= 10 - Confidence:
min(vector_ratio, scanned_ratio)whereratio = count / 64 - Returns
hybrid_cellsset containing scanned cell indexes
Rotation Handling
GridClassifierstores rotation (0, 90, 180, 270)- Width/height are expected to be post-rotation values
- Coordinates should be transformed by rotation matrix before
point_to_cell()
Test Results
running 32 tests
test classify::tests::test_critical_hybrid_page_text_header_scanned_body ... ok
test classify::tests::test_grid_classifier_below_threshold ... ok
test classify::tests::test_determinism_classify_twice ... ok
test classify::tests::test_grid_classifier_hybrid_detection ... ok
test classify::tests::test_exactly_10_cells_threshold ... ok
... (28 more classify tests) ...
test result: ok. 32 passed; 0 failed
Files Modified/Created
crates/pdftract-core/src/classify.rs(new file, 705 lines)crates/pdftract-core/src/lib.rs(already exportsclassifymodule)
No WARN Items
All acceptance criteria met without environmental blockers.