Add 8x8 grid decomposition for mixed-content page detection. Implements Phase 5.1.3 hybrid detection: - GridClassifier: 8x8 grid (64 cells) per page - Cell classification: vector (text+validity), scanned (image,no-text), mixed - Hybrid trigger: >=10 vector cells AND >=10 scanned cells (>=15% each) - Returns scanned cell indexes for downstream OCR-only-on-cells routing Acceptance criteria: - PASS: Critical test (text header + scanned body) -> Hybrid with correct cells - PASS: Below threshold (9+9 cells) -> NOT Hybrid - PASS: Determinism (BTreeSet for stable serialization) - PASS: Cells exposed for Phase 5.2 OCR routing Refs: bead pdftract-347, plan line 1838 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
77 lines
2.9 KiB
Markdown
77 lines
2.9 KiB
Markdown
# Verification Note: pdftract-347
|
||
|
||
## Task
|
||
5.1.3: Hybrid grid-cell evaluator (8x8 decomposition + >=15% rule)
|
||
|
||
## Summary
|
||
Implemented the per-region Hybrid evaluator that detects mixed-content pages by 8x8 grid decomposition. The implementation is in `crates/pdftract-core/src/classify.rs` and includes all required types and tests.
|
||
|
||
## Acceptance Criteria
|
||
|
||
### PASS: Critical test - hybrid page with text header (top 2 rows) + scanned body (bottom 6 rows)
|
||
- Test: `test_critical_hybrid_page_text_header_scanned_body`
|
||
- Result: PASS
|
||
- Verifies:
|
||
- Classification is `PageClass::Hybrid`
|
||
- `hybrid_cells` contains exactly 48 cells (6 rows × 8 cols)
|
||
- All scanned cells are from rows 2-7 only (no vector header cells included)
|
||
|
||
### PASS: Unit test - below threshold (9 vector + 9 scanned cells)
|
||
- Test: `test_grid_classifier_below_threshold`
|
||
- Result: PASS
|
||
- Verifies:
|
||
- Page is NOT classified as Hybrid (below 10-cell threshold)
|
||
- `hybrid_cells` is None for non-Hybrid pages
|
||
|
||
### PASS: Determinism - classify twice produces byte-identical serialization
|
||
- Test: `test_determinism_classify_twice`
|
||
- Result: PASS
|
||
- Uses `BTreeSet` (not `HashSet`) for deterministic ordering
|
||
- Verifies JSON serialization is byte-identical across runs
|
||
|
||
### PASS: Cells exposed for 5.2 OCR routing
|
||
- `PageClassification.hybrid_cells: Option<BTreeSet<usize>>`
|
||
- Contains flat cell indices (0-63) for scanned cells
|
||
- Ready for downstream OCR-only-on-cells routing in Phase 5.2
|
||
|
||
## Implementation Details
|
||
|
||
### Grid Decomposition
|
||
- 8 rows × 8 cols = 64 cells
|
||
- Cell index: `row * 8 + col` (0-63)
|
||
- Row 0 = top of page (after rotation applied)
|
||
- Col 0 = left of page
|
||
|
||
### Cell Classification Rules
|
||
- **Vector**: `text_op_count > 0 AND char_validity > 0.6`
|
||
- **Scanned**: `image_coverage > 0.80 AND text_op_count == 0`
|
||
- **Mixed**: neither condition met (empty or ambiguous)
|
||
|
||
### Hybrid Detection Rule
|
||
- Hybrid when: `vector_cell_count >= 10 AND scanned_cell_count >= 10`
|
||
- Confidence: `min(vector_ratio, scanned_ratio)` where `ratio = count / 64`
|
||
- Returns `hybrid_cells` set containing scanned cell indexes
|
||
|
||
### Rotation Handling
|
||
- `GridClassifier` stores rotation (0, 90, 180, 270)
|
||
- Width/height are expected to be post-rotation values
|
||
- Coordinates should be transformed by rotation matrix before `point_to_cell()`
|
||
|
||
## Test Results
|
||
```
|
||
running 32 tests
|
||
test classify::tests::test_critical_hybrid_page_text_header_scanned_body ... ok
|
||
test classify::tests::test_grid_classifier_below_threshold ... ok
|
||
test classify::tests::test_determinism_classify_twice ... ok
|
||
test classify::tests::test_grid_classifier_hybrid_detection ... ok
|
||
test classify::tests::test_exactly_10_cells_threshold ... ok
|
||
... (28 more classify tests) ...
|
||
test result: ok. 32 passed; 0 failed
|
||
```
|
||
|
||
## Files Modified/Created
|
||
- `crates/pdftract-core/src/classify.rs` (new file, 705 lines)
|
||
- `crates/pdftract-core/src/lib.rs` (already exports `classify` module)
|
||
|
||
## No WARN Items
|
||
All acceptance criteria met without environmental blockers.
|