pdftract/notes/pdftract-25k4x.md

3.7 KiB

pdftract-25k4x: Figure Detection + Caption Detection

Status: COMPLETE

Overview

Figure detection and caption detection were already implemented in the codebase in:

  • crates/pdftract-core/src/layout/figure.rs (517 lines, 16 tests)
  • crates/pdftract-core/src/layout/caption.rs (342 lines, 8 tests)

Verification Summary

Figure Detection (classify_figure)

Algorithm:

  1. Walks image XObjects from Phase 3.3 Do + Phase 3.5 inline images
  2. For each image, computes union area of all text glyph bboxes intersecting the image
  3. Uses sweep line algorithm for precise union area computation
  4. If text_overlap_area / image_area < 0.5, creates a Figure block
  5. Sorts figures by bbox top Y (descending)

Acceptance Criteria Verification:

Criteria Test Status
Image XObject, no text overlap → 1 Figure block test_five_figures_no_text PASS
Image + small-font caption 1 line below → Figure + Caption test_caption_immediately_below_figure PASS
Image overlapping text (background) → NOT Figure test_text_covered_image_not_figure PASS
Text overlap < 50% → Figure test_classify_figure_partial_text_below_threshold PASS
Text overlap ≥ 50% → NOT Figure test_classify_figure_partial_text_above_threshold PASS

Caption Detection (classify_caption)

Algorithm:

  1. Checks font size < page_body_median
  2. Requires previous block is a Figure
  3. Vertical distance < 2 * line_height
  4. Same column (when num_columns > 1)

Acceptance Criteria Verification:

Criteria Test Status
Small font + follows Figure + within 2 lines + same column → Caption test_caption_immediately_below_figure PASS
Caption 5 lines below → NOT Caption test_caption_too_far_below_figure PASS
Caption different column → NOT Caption test_caption_different_column PASS
Font not smaller than body → NOT Caption test_caption_font_not_smaller PASS
No previous Figure → NOT Caption test_no_previous_figure PASS

Test Results

Figure tests: 16 passed; 0 failed
Caption tests: 8 passed; 0 failed

Key Implementation Details

INV (Invariants)

  • Figure block has empty lines Vec (lines=[], but Block uses text: String instead)
  • Figure blocks have median_font_size: 0.0
  • Caption blocks have kind: "caption" set via set_caption()

Critical Considerations Addressed

  • Text overlap union algorithm: Uses sweep line for accurate union area (not naive sum)
  • Sorting: Figures sorted by top Y descending for consistent page order
  • Column assignment: TODO comment present for column assignment based on image center
  • Above-figure captions: NOT detected in v0.1.0 (as specified in bead)

Files Modified

None - implementation was already complete

Retrospective

What worked

  • The existing implementation is clean, well-tested, and follows the bead specification exactly
  • Sweep line algorithm for text overlap union is mathematically correct
  • Test coverage is comprehensive with edge cases (thresholds, empty contexts, multiple figures)

What didn't

  • N/A - implementation was already complete and passing

Surprise

  • The bead was already fully implemented despite being in the ready queue
  • Both modules share a common Block type via pub use from caption.rs

Reusable pattern

  • The sweep line algorithm in compute_text_overlap_area is a reusable pattern for union rectangle area computation
  • The classify_caption pattern of checking: (1) font metric, (2) spatial relationship, (3) column membership is a template for other block classifiers