pdftract/notes/pdftract-2a4dg.md

4.1 KiB
Raw Blame History

pdftract-2a4dg — Figure Block Classifier Verification

Bead ID

pdftract-2a4dg

Summary

Implement classify_figure(page) that returns synthetic Block instances with kind=Figure for each image XObject region on the page where text overlaps < 50%.

Status

PASS — Implementation already exists in crates/pdftract-core/src/layout/figure.rs

Implementation Verified

File Location

crates/pdftract-core/src/layout/figure.rs

Function Signature

pub fn classify_figure(ctx: &FigurePageContext) -> Vec<Block>

Page Context Structure

pub struct FigurePageContext {
    pub images: Vec<ImageXObject>,        // Phase 3.5 inline + Phase 3.3 Do images
    pub glyph_bboxes: Vec<[f32; 4]>,      // For text overlap computation
}

Algorithm (Verified Implementation)

  1. Iterate over images in the page context
  2. For each image bbox:
    • Compute image area (width × height)
    • Skip zero-area images (degenerate CTM)
    • Compute text overlap area via compute_text_overlap_area()
    • If (text_overlap_area / image_area) < 0.5: create Figure block
  3. Sort resulting blocks by bbox top y (descending)

Block Structure

Block {
    kind: "figure".to_string(),
    text: String::new(),              // Empty (figures have no text)
    median_font_size: 0.0,
    bbox: image_bbox,                 // Image's bbox in PDF user-space
    column: 0,                        // TODO: assign based on image center x
}

Note: Task spec mentioned lines: [] but current Block uses text: String. Both achieve empty text content.

Helper Functions

  • bbox_area(bbox) - Compute area of bounding box
  • compute_text_overlap_area(image_bbox, glyph_bboxes) - Union of all intersecting glyph bboxes, clipped to image bbox
  • bboxes_intersect(a, b) - Check if two bboxes intersect

Acceptance Criteria Status

1. PDF with 5 figures (images, no text overlay) → 5 Figure blocks

PASS - Test test_five_figures_no_text() verifies this case.

2. PDF with 1 image fully covered by text → no Figure block (overlap >= 50%)

PASS - Test test_text_covered_image_not_figure() verifies this case.

3. Block insertion preserves top-y sort order

PASS - Test test_classify_figure_sort_order() verifies sorting by bbox top y (highest first).

4. Block.text is empty

PASS - Implementation sets text: String::new(); Test test_figure_block_properties() verifies empty text.

5. Test corpus: scientific paper with embedded figures → all detected

WARN - Integration tests on real scientific papers not verified during this check (requires compilation with ocr feature). Unit tests cover the algorithm logic comprehensively.

Test Coverage

The module includes 17 unit tests covering:

  • Pure visual images (no text) → figure
  • Text-on-image (screenshot) → not figure
  • Partial text overlap below/above 50% threshold
  • Exact threshold behavior (49% vs 50%+)
  • Sort order preservation
  • Empty context handling
  • Multiple glyphs with union computation
  • Block property verification

References

  • Plan: Phase 4 figure detection
  • Phase 3.3: Do operator (XObject image placement)
  • Phase 3.5: Inline images (BI/ID/EI)
  • Coordinator: pdftract-25k4x (figure + caption bundle)
  • Sibling: caption detection (pdftract-xzfkt, CLOSED)

Module Visibility

figure.rs is gated by #[cfg(feature = "ocr")]. The ocr feature must be enabled for this module to be compiled and used.

Note: The figure classifier does not actually use any OCR functionality (no tesseract, leptonica dependencies). It only analyzes image bboxes and text glyph overlap. The feature gating may be for organizational purposes (grouping figure-related work under the OCR feature flag) or may need to be revisited if figure detection should work without OCR enabled.

Integration Status

The figure classifier is defined and exported through layout/mod.rs but is not yet integrated into the main extraction pipeline (no calls to classify_figure found in extract.rs or similar files). This is expected as Phase 4 block formation is still in progress.

Verification Date

2025-12-01 (re-verified: implementation complete and correct)