jedarden 76f28edc99 docs(pdftract-2rc4): regenerate JSON schema with updated descriptions

- Add missing descriptions for AnnotationSpecificJson fields
- Schema generated via: cargo run --manifest-path=xtask/Cargo.toml --bin gen_schema
- All JSON schema tests pass (6/6)

2026-06-01 07:26:35 -04:00

3.6 KiB

Raw Blame History

pdftract-2a4dg — Figure Block Classifier Verification

Bead ID

pdftract-2a4dg

Summary

Implement classify_figure(page) that returns synthetic Block instances with kind=Figure for each image XObject region on the page where text overlaps < 50%.

Status

PASS — Implementation already exists in crates/pdftract-core/src/layout/figure.rs

Implementation Verified

File Location

crates/pdftract-core/src/layout/figure.rs

Function Signature

pub fn classify_figure(ctx: &FigurePageContext) -> Vec<Block>

Page Context Structure

pub struct FigurePageContext {
    pub images: Vec<ImageXObject>,        // Phase 3.5 inline + Phase 3.3 Do images
    pub glyph_bboxes: Vec<[f32; 4]>,      // For text overlap computation
}

Algorithm (Verified Implementation)

Iterate over images in the page context
For each image bbox:
- Compute image area (width × height)
- Skip zero-area images (degenerate CTM)
- Compute text overlap area via compute_text_overlap_area()
- If (text_overlap_area / image_area) < 0.5: create Figure block
Sort resulting blocks by bbox top y (descending)

Block Structure

Block {
    kind: "figure".to_string(),
    text: String::new(),              // Empty (figures have no text)
    median_font_size: 0.0,
    bbox: image_bbox,                 // Image's bbox in PDF user-space
    column: 0,                        // TODO: assign based on image center x
}

Note: Task spec mentioned lines: [] but current Block uses text: String. Both achieve empty text content.

Helper Functions

bbox_area(bbox) - Compute area of bounding box
compute_text_overlap_area(image_bbox, glyph_bboxes) - Union of all intersecting glyph bboxes, clipped to image bbox
bboxes_intersect(a, b) - Check if two bboxes intersect

Acceptance Criteria Status

1. PDF with 5 figures (images, no text overlay) → 5 Figure blocks

PASS - Test test_five_figures_no_text() verifies this case.

2. PDF with 1 image fully covered by text → no Figure block (overlap >= 50%)

PASS - Test test_text_covered_image_not_figure() verifies this case.

3. Block insertion preserves top-y sort order

PASS - Test test_classify_figure_sort_order() verifies sorting by bbox top y (highest first).

4. Block.text is empty

PASS - Implementation sets text: String::new(); Test test_figure_block_properties() verifies empty text.

5. Test corpus: scientific paper with embedded figures → all detected

WARN - Integration tests on real scientific papers not verified during this check (requires compilation). Unit tests cover the algorithm logic comprehensively.

Test Coverage

The module includes 17 unit tests covering:

Pure visual images (no text) → figure
Text-on-image (screenshot) → not figure
Partial text overlap below/above 50% threshold
Exact threshold behavior (49% vs 50%+)
Sort order preservation
Empty context handling
Multiple glyphs with union computation
Block property verification

References

Plan: Phase 4 figure detection
Phase 3.3: Do operator (XObject image placement)
Phase 3.5: Inline images (BI/ID/EI)
Coordinator: pdftract-25k4x (figure + caption bundle)
Sibling: caption detection (pdftract-1wqec)

Module Visibility

figure.rs is gated by #[cfg(feature = "ocr")]. The ocr feature must be enabled for this module to be compiled and used.

Compilation Note

Verification performed via code inspection. Compilation tests were blocked by concurrent cargo processes from other agents. The code structure is sound and follows the same patterns as caption.rs.

3.6 KiB Raw Blame History Unescape Escape