4.1 KiB
pdftract-2a4dg — Figure Block Classifier Verification
Bead ID
pdftract-2a4dg
Summary
Implement classify_figure(page) that returns synthetic Block instances with kind=Figure for each image XObject region on the page where text overlaps < 50%.
Status
PASS — Implementation already exists in crates/pdftract-core/src/layout/figure.rs
Implementation Verified
File Location
crates/pdftract-core/src/layout/figure.rs
Function Signature
pub fn classify_figure(ctx: &FigurePageContext) -> Vec<Block>
Page Context Structure
pub struct FigurePageContext {
pub images: Vec<ImageXObject>, // Phase 3.5 inline + Phase 3.3 Do images
pub glyph_bboxes: Vec<[f32; 4]>, // For text overlap computation
}
Algorithm (Verified Implementation)
- Iterate over
imagesin the page context - For each image bbox:
- Compute image area (width × height)
- Skip zero-area images (degenerate CTM)
- Compute text overlap area via
compute_text_overlap_area() - If
(text_overlap_area / image_area) < 0.5: create Figure block
- Sort resulting blocks by bbox top y (descending)
Block Structure
Block {
kind: "figure".to_string(),
text: String::new(), // Empty (figures have no text)
median_font_size: 0.0,
bbox: image_bbox, // Image's bbox in PDF user-space
column: 0, // TODO: assign based on image center x
}
Note: Task spec mentioned lines: [] but current Block uses text: String. Both achieve empty text content.
Helper Functions
bbox_area(bbox)- Compute area of bounding boxcompute_text_overlap_area(image_bbox, glyph_bboxes)- Union of all intersecting glyph bboxes, clipped to image bboxbboxes_intersect(a, b)- Check if two bboxes intersect
Acceptance Criteria Status
1. PDF with 5 figures (images, no text overlay) → 5 Figure blocks
PASS - Test test_five_figures_no_text() verifies this case.
2. PDF with 1 image fully covered by text → no Figure block (overlap >= 50%)
PASS - Test test_text_covered_image_not_figure() verifies this case.
3. Block insertion preserves top-y sort order
PASS - Test test_classify_figure_sort_order() verifies sorting by bbox top y (highest first).
4. Block.text is empty
PASS - Implementation sets text: String::new(); Test test_figure_block_properties() verifies empty text.
5. Test corpus: scientific paper with embedded figures → all detected
WARN - Integration tests on real scientific papers not verified during this check (requires compilation with ocr feature). Unit tests cover the algorithm logic comprehensively.
Test Coverage
The module includes 17 unit tests covering:
- Pure visual images (no text) → figure
- Text-on-image (screenshot) → not figure
- Partial text overlap below/above 50% threshold
- Exact threshold behavior (49% vs 50%+)
- Sort order preservation
- Empty context handling
- Multiple glyphs with union computation
- Block property verification
References
- Plan: Phase 4 figure detection
- Phase 3.3: Do operator (XObject image placement)
- Phase 3.5: Inline images (BI/ID/EI)
- Coordinator: pdftract-25k4x (figure + caption bundle)
- Sibling: caption detection (pdftract-xzfkt, CLOSED)
Module Visibility
figure.rs is gated by #[cfg(feature = "ocr")]. The ocr feature must be enabled for this module to be compiled and used.
Note: The figure classifier does not actually use any OCR functionality (no tesseract, leptonica dependencies). It only analyzes image bboxes and text glyph overlap. The feature gating may be for organizational purposes (grouping figure-related work under the OCR feature flag) or may need to be revisited if figure detection should work without OCR enabled.
Integration Status
The figure classifier is defined and exported through layout/mod.rs but is not yet integrated into the main extraction pipeline (no calls to classify_figure found in extract.rs or similar files). This is expected as Phase 4 block formation is still in progress.
Verification Date
2025-12-01 (re-verified: implementation complete and correct)