jedarden/pdftract

Fork 0

jedarden 8fe61a1ba5 docs(pdftract-25k4x): add verification note for figure/caption detection

2026-06-01 09:35:02 -04:00

3.7 KiB

Raw Blame History

pdftract-25k4x: Figure Detection + Caption Detection

Status: COMPLETE

Overview

Figure detection and caption detection were already implemented in the codebase in:

crates/pdftract-core/src/layout/figure.rs (517 lines, 16 tests)
crates/pdftract-core/src/layout/caption.rs (342 lines, 8 tests)

Verification Summary

Figure Detection (`classify_figure`)

Algorithm:

Walks image XObjects from Phase 3.3 Do + Phase 3.5 inline images
For each image, computes union area of all text glyph bboxes intersecting the image
Uses sweep line algorithm for precise union area computation
If text_overlap_area / image_area < 0.5, creates a Figure block
Sorts figures by bbox top Y (descending)

Acceptance Criteria Verification:

Criteria	Test	Status
Image XObject, no text overlap → 1 Figure block	`test_five_figures_no_text`	✅ PASS
Image + small-font caption 1 line below → Figure + Caption	`test_caption_immediately_below_figure`	✅ PASS
Image overlapping text (background) → NOT Figure	`test_text_covered_image_not_figure`	✅ PASS
Text overlap < 50% → Figure	`test_classify_figure_partial_text_below_threshold`	✅ PASS
Text overlap ≥ 50% → NOT Figure	`test_classify_figure_partial_text_above_threshold`	✅ PASS

Caption Detection (`classify_caption`)

Algorithm:

Checks font size < page_body_median
Requires previous block is a Figure
Vertical distance < 2 * line_height
Same column (when num_columns > 1)

Acceptance Criteria Verification:

Criteria	Test	Status
Small font + follows Figure + within 2 lines + same column → Caption	`test_caption_immediately_below_figure`	✅ PASS
Caption 5 lines below → NOT Caption	`test_caption_too_far_below_figure`	✅ PASS
Caption different column → NOT Caption	`test_caption_different_column`	✅ PASS
Font not smaller than body → NOT Caption	`test_caption_font_not_smaller`	✅ PASS
No previous Figure → NOT Caption	`test_no_previous_figure`	✅ PASS

Test Results

Figure tests: 16 passed; 0 failed
Caption tests: 8 passed; 0 failed

Key Implementation Details

INV (Invariants)

✅ Figure block has empty lines Vec (lines=[], but Block uses text: String instead)
✅ Figure blocks have median_font_size: 0.0
✅ Caption blocks have kind: "caption" set via set_caption()

Critical Considerations Addressed

Text overlap union algorithm: Uses sweep line for accurate union area (not naive sum)
Sorting: Figures sorted by top Y descending for consistent page order
Column assignment: TODO comment present for column assignment based on image center
Above-figure captions: NOT detected in v0.1.0 (as specified in bead)

Files Modified

None - implementation was already complete

Retrospective

What worked

The existing implementation is clean, well-tested, and follows the bead specification exactly
Sweep line algorithm for text overlap union is mathematically correct
Test coverage is comprehensive with edge cases (thresholds, empty contexts, multiple figures)

What didn't

N/A - implementation was already complete and passing

Surprise

The bead was already fully implemented despite being in the ready queue
Both modules share a common Block type via pub use from caption.rs

Reusable pattern

The sweep line algorithm in compute_text_overlap_area is a reusable pattern for union rectangle area computation
The classify_caption pattern of checking: (1) font metric, (2) spatial relationship, (3) column membership is a template for other block classifiers

3.7 KiB Raw Blame History