jedarden/pdftract

Fork 0

jedarden 0691c3f543 docs(pdftract-4bgp): add verification note for /EmbeddedFiles name tree walker + /AF fallback

2026-06-01 07:26:35 -04:00

4.1 KiB

Raw Blame History

pdftract-2a4dg — Figure Block Classifier Verification

Bead ID

pdftract-2a4dg

Summary

Implement classify_figure(page) that returns synthetic Block instances with kind=Figure for each image XObject region on the page where text overlaps < 50%.

Status

PASS — Implementation already exists in crates/pdftract-core/src/layout/figure.rs

Implementation Verified

File Location

crates/pdftract-core/src/layout/figure.rs

Function Signature

pub fn classify_figure(ctx: &FigurePageContext) -> Vec<Block>

Page Context Structure

pub struct FigurePageContext {
    pub images: Vec<ImageXObject>,        // Phase 3.5 inline + Phase 3.3 Do images
    pub glyph_bboxes: Vec<[f32; 4]>,      // For text overlap computation
}

Algorithm (Verified Implementation)

Iterate over images in the page context
For each image bbox:
- Compute image area (width × height)
- Skip zero-area images (degenerate CTM)
- Compute text overlap area via compute_text_overlap_area()
- If (text_overlap_area / image_area) < 0.5: create Figure block
Sort resulting blocks by bbox top y (descending)

Block Structure

Block {
    kind: "figure".to_string(),
    text: String::new(),              // Empty (figures have no text)
    median_font_size: 0.0,
    bbox: image_bbox,                 // Image's bbox in PDF user-space
    column: 0,                        // TODO: assign based on image center x
}

Note: Task spec mentioned lines: [] but current Block uses text: String. Both achieve empty text content.

Helper Functions

bbox_area(bbox) - Compute area of bounding box
compute_text_overlap_area(image_bbox, glyph_bboxes) - Union of all intersecting glyph bboxes, clipped to image bbox
bboxes_intersect(a, b) - Check if two bboxes intersect

Acceptance Criteria Status

1. PDF with 5 figures (images, no text overlay) → 5 Figure blocks

PASS - Test test_five_figures_no_text() verifies this case.

2. PDF with 1 image fully covered by text → no Figure block (overlap >= 50%)

PASS - Test test_text_covered_image_not_figure() verifies this case.

3. Block insertion preserves top-y sort order

PASS - Test test_classify_figure_sort_order() verifies sorting by bbox top y (highest first).

4. Block.text is empty

PASS - Implementation sets text: String::new(); Test test_figure_block_properties() verifies empty text.

5. Test corpus: scientific paper with embedded figures → all detected

WARN - Integration tests on real scientific papers not verified during this check (requires compilation with ocr feature). Unit tests cover the algorithm logic comprehensively.

Test Coverage

The module includes 17 unit tests covering:

Pure visual images (no text) → figure
Text-on-image (screenshot) → not figure
Partial text overlap below/above 50% threshold
Exact threshold behavior (49% vs 50%+)
Sort order preservation
Empty context handling
Multiple glyphs with union computation
Block property verification

References

Plan: Phase 4 figure detection
Phase 3.3: Do operator (XObject image placement)
Phase 3.5: Inline images (BI/ID/EI)
Coordinator: pdftract-25k4x (figure + caption bundle)
Sibling: caption detection (pdftract-xzfkt, CLOSED)

Module Visibility

figure.rs is gated by #[cfg(feature = "ocr")]. The ocr feature must be enabled for this module to be compiled and used.

Note: The figure classifier does not actually use any OCR functionality (no tesseract, leptonica dependencies). It only analyzes image bboxes and text glyph overlap. The feature gating may be for organizational purposes (grouping figure-related work under the OCR feature flag) or may need to be revisited if figure detection should work without OCR enabled.

Integration Status

The figure classifier is defined and exported through layout/mod.rs but is not yet integrated into the main extraction pipeline (no calls to classify_figure found in extract.rs or similar files). This is expected as Phase 4 block formation is still in progress.

Verification Date

2025-12-01 (re-verified: implementation complete and correct)

4.1 KiB Raw Blame History Unescape Escape