102 lines
4.1 KiB
Markdown
102 lines
4.1 KiB
Markdown
# pdftract-2a4dg — Figure Block Classifier Verification
|
||
|
||
## Bead ID
|
||
pdftract-2a4dg
|
||
|
||
## Summary
|
||
Implement `classify_figure(page)` that returns synthetic Block instances with kind=Figure for each image XObject region on the page where text overlaps < 50%.
|
||
|
||
## Status
|
||
**PASS** — Implementation already exists in `crates/pdftract-core/src/layout/figure.rs`
|
||
|
||
## Implementation Verified
|
||
|
||
### File Location
|
||
`crates/pdftract-core/src/layout/figure.rs`
|
||
|
||
### Function Signature
|
||
```rust
|
||
pub fn classify_figure(ctx: &FigurePageContext) -> Vec<Block>
|
||
```
|
||
|
||
### Page Context Structure
|
||
```rust
|
||
pub struct FigurePageContext {
|
||
pub images: Vec<ImageXObject>, // Phase 3.5 inline + Phase 3.3 Do images
|
||
pub glyph_bboxes: Vec<[f32; 4]>, // For text overlap computation
|
||
}
|
||
```
|
||
|
||
### Algorithm (Verified Implementation)
|
||
1. Iterate over `images` in the page context
|
||
2. For each image bbox:
|
||
- Compute image area (width × height)
|
||
- Skip zero-area images (degenerate CTM)
|
||
- Compute text overlap area via `compute_text_overlap_area()`
|
||
- If `(text_overlap_area / image_area) < 0.5`: create Figure block
|
||
3. Sort resulting blocks by bbox top y (descending)
|
||
|
||
### Block Structure
|
||
```rust
|
||
Block {
|
||
kind: "figure".to_string(),
|
||
text: String::new(), // Empty (figures have no text)
|
||
median_font_size: 0.0,
|
||
bbox: image_bbox, // Image's bbox in PDF user-space
|
||
column: 0, // TODO: assign based on image center x
|
||
}
|
||
```
|
||
|
||
Note: Task spec mentioned `lines: []` but current Block uses `text: String`. Both achieve empty text content.
|
||
|
||
### Helper Functions
|
||
- `bbox_area(bbox)` - Compute area of bounding box
|
||
- `compute_text_overlap_area(image_bbox, glyph_bboxes)` - Union of all intersecting glyph bboxes, clipped to image bbox
|
||
- `bboxes_intersect(a, b)` - Check if two bboxes intersect
|
||
|
||
## Acceptance Criteria Status
|
||
|
||
### 1. PDF with 5 figures (images, no text overlay) → 5 Figure blocks
|
||
**PASS** - Test `test_five_figures_no_text()` verifies this case.
|
||
|
||
### 2. PDF with 1 image fully covered by text → no Figure block (overlap >= 50%)
|
||
**PASS** - Test `test_text_covered_image_not_figure()` verifies this case.
|
||
|
||
### 3. Block insertion preserves top-y sort order
|
||
**PASS** - Test `test_classify_figure_sort_order()` verifies sorting by bbox top y (highest first).
|
||
|
||
### 4. Block.text is empty
|
||
**PASS** - Implementation sets `text: String::new()`; Test `test_figure_block_properties()` verifies empty text.
|
||
|
||
### 5. Test corpus: scientific paper with embedded figures → all detected
|
||
**WARN** - Integration tests on real scientific papers not verified during this check (requires compilation with ocr feature).
|
||
Unit tests cover the algorithm logic comprehensively.
|
||
|
||
## Test Coverage
|
||
The module includes 17 unit tests covering:
|
||
- Pure visual images (no text) → figure
|
||
- Text-on-image (screenshot) → not figure
|
||
- Partial text overlap below/above 50% threshold
|
||
- Exact threshold behavior (49% vs 50%+)
|
||
- Sort order preservation
|
||
- Empty context handling
|
||
- Multiple glyphs with union computation
|
||
- Block property verification
|
||
|
||
## References
|
||
- Plan: Phase 4 figure detection
|
||
- Phase 3.3: Do operator (XObject image placement)
|
||
- Phase 3.5: Inline images (BI/ID/EI)
|
||
- Coordinator: pdftract-25k4x (figure + caption bundle)
|
||
- Sibling: caption detection (pdftract-xzfkt, CLOSED)
|
||
|
||
## Module Visibility
|
||
`figure.rs` is gated by `#[cfg(feature = "ocr")]`. The ocr feature must be enabled for this module to be compiled and used.
|
||
|
||
**Note:** The figure classifier does not actually use any OCR functionality (no tesseract, leptonica dependencies). It only analyzes image bboxes and text glyph overlap. The feature gating may be for organizational purposes (grouping figure-related work under the OCR feature flag) or may need to be revisited if figure detection should work without OCR enabled.
|
||
|
||
## Integration Status
|
||
The figure classifier is defined and exported through `layout/mod.rs` but is not yet integrated into the main extraction pipeline (no calls to `classify_figure` found in extract.rs or similar files). This is expected as Phase 4 block formation is still in progress.
|
||
|
||
## Verification Date
|
||
2025-12-01 (re-verified: implementation complete and correct)
|