docs(pdftract-25k4x): verify figure and caption detection implementation

Add verification note confirming all acceptance criteria PASS.
- Figure classifier: 16/16 tests pass
- Caption classifier: 8/8 tests pass
- All acceptance criteria verified against code

Closes pdftract-25k4x
This commit is contained in:
jedarden 2026-06-01 10:55:56 -04:00
parent 4ef7817415
commit e8992816ce

View file

@ -1,83 +1,75 @@
# pdftract-25k4x: Figure Detection + Caption Detection
# Figure and Caption Detection Verification - pdftract-25k4x
## Status: COMPLETE
## Acceptance Criteria Verification
## Overview
Figure detection and caption detection were already implemented in the codebase in:
- `crates/pdftract-core/src/layout/figure.rs` (517 lines, 16 tests)
- `crates/pdftract-core/src/layout/caption.rs` (342 lines, 8 tests)
### 1. Image XObject, no text overlap: 1 Figure block
**Location:** `crates/pdftract-core/src/layout/figure.rs:130`
- Checks `text_overlap_area / image_area < 0.5`
- Creates Block with kind="figure"
- **PASS** - Tests: `test_classify_figure_pure_visual_image`, `test_classify_figure_no_glyphs`
## Verification Summary
### 2. Image + small-font caption 1 line below: Figure + Caption
**Location:** `crates/pdftract-core/src/layout/caption.rs:126-140`
- Checks `block.median_font_size < ctx.page_body_median` (small font)
- Checks `vertical_distance < 2.0 * ctx.line_height` (within 2 lines)
- Sets kind to "caption"
- **PASS** - Test: `test_caption_immediately_below_figure`
### Figure Detection (`classify_figure`)
**Algorithm:**
1. Walks image XObjects from Phase 3.3 Do + Phase 3.5 inline images
2. For each image, computes union area of all text glyph bboxes intersecting the image
3. Uses sweep line algorithm for precise union area computation
4. If `text_overlap_area / image_area < 0.5`, creates a Figure block
5. Sorts figures by bbox top Y (descending)
### 3. Image overlapping text (background): NOT Figure
**Location:** `crates/pdftract-core/src/layout/figure.rs:130`
- Images with >= 50% text overlap are NOT classified as figures
- **PASS** - Tests: `test_classify_figure_text_on_image`, `test_classify_figure_partial_text_above_threshold`
**Acceptance Criteria Verification:**
| Criteria | Test | Status |
|----------|------|--------|
| Image XObject, no text overlap → 1 Figure block | `test_five_figures_no_text` | ✅ PASS |
| Image + small-font caption 1 line below → Figure + Caption | `test_caption_immediately_below_figure` | ✅ PASS |
| Image overlapping text (background) → NOT Figure | `test_text_covered_image_not_figure` | ✅ PASS |
| Text overlap < 50% Figure | `test_classify_figure_partial_text_below_threshold` | PASS |
| Text overlap ≥ 50% → NOT Figure | `test_classify_figure_partial_text_above_threshold` | ✅ PASS |
### 4. Caption 5 lines below: NOT Caption
**Location:** `crates/pdftract-core/src/layout/caption.rs:145-148`
- Checks `vertical_distance >= 2.0 * ctx.line_height`
- Returns false if too far below
- **PASS** - Test: `test_caption_too_far_below_figure`
### Caption Detection (`classify_caption`)
**Algorithm:**
1. Checks font size < page_body_median
2. Requires previous block is a Figure
3. Vertical distance < 2 * line_height
4. Same column (when num_columns > 1)
**Acceptance Criteria Verification:**
| Criteria | Test | Status |
|----------|------|--------|
| Small font + follows Figure + within 2 lines + same column → Caption | `test_caption_immediately_below_figure` | ✅ PASS |
| Caption 5 lines below → NOT Caption | `test_caption_too_far_below_figure` | ✅ PASS |
| Caption different column → NOT Caption | `test_caption_different_column` | ✅ PASS |
| Font not smaller than body → NOT Caption | `test_caption_font_not_smaller` | ✅ PASS |
| No previous Figure → NOT Caption | `test_no_previous_figure` | ✅ PASS |
### 5. Caption different column: NOT Caption
**Location:** `crates/pdftract-core/src/layout/caption.rs:152-154`
- Checks `block.column != figure.column` in multi-column layouts
- Returns false if different column
- **PASS** - Test: `test_caption_different_column`
## Test Results
```
Figure tests: 16 passed; 0 failed
Caption tests: 8 passed; 0 failed
```
## Key Implementation Details
### Figure Classifier Tests (16/16 PASS)
- test_bboxes_intersect
- test_classify_figure_no_images
- test_classify_figure_partial_text_below_threshold
- test_classify_figure_partial_text_above_threshold
- test_classify_figure_exactly_at_threshold
- test_classify_figure_no_glyphs
- test_classify_figure_pure_visual_image
- test_bbox_area
- test_classify_figure_sort_order
- test_classify_figure_empty_context
- test_classify_figure_text_on_image
- test_compute_text_overlap_area_multiple_glyphs
- test_compute_text_overlap_area_union
- test_figure_block_properties
- test_five_figures_no_text
- test_text_covered_image_not_figure
### INV (Invariants)
- ✅ Figure block has empty `lines` Vec (lines=[], but Block uses `text: String` instead)
- ✅ Figure blocks have `median_font_size: 0.0`
- ✅ Caption blocks have `kind: "caption"` set via `set_caption()`
### Caption Classifier Tests (8/8 PASS)
- test_caption_above_figure
- test_caption_font_not_smaller
- test_caption_too_far_below_figure
- test_no_previous_figure
- test_caption_different_column
- test_caption_immediately_below_figure
- test_block_accessors
- test_page_classification
### Critical Considerations Addressed
- **Text overlap union algorithm**: Uses sweep line for accurate union area (not naive sum)
- **Sorting**: Figures sorted by top Y descending for consistent page order
- **Column assignment**: TODO comment present for column assignment based on image center
- **Above-figure captions**: NOT detected in v0.1.0 (as specified in bead)
## INV Verification
- **INV: Figure block has empty lines Vec** - SATISFIED: Block created with text=String::empty(), median_font_size=0.0
- **Caption above figure NOT detected in v0.1.0** - SATISFIED: caption.rs test_caption_above_figure returns false
## Files Modified
None - implementation was already complete
## Files Verified
- crates/pdftract-core/src/layout/figure.rs (517 lines)
- crates/pdftract-core/src/layout/caption.rs (342 lines)
- crates/pdftract-core/src/layout/mod.rs (exports classifiers)
## Retrospective
### What worked
- The existing implementation is clean, well-tested, and follows the bead specification exactly
- Sweep line algorithm for text overlap union is mathematically correct
- Test coverage is comprehensive with edge cases (thresholds, empty contexts, multiple figures)
### What didn't
- N/A - implementation was already complete and passing
### Surprise
- The bead was already fully implemented despite being in the ready queue
- Both modules share a common `Block` type via `pub use` from caption.rs
### Reusable pattern
- The sweep line algorithm in `compute_text_overlap_area` is a reusable pattern for union rectangle area computation
- The `classify_caption` pattern of checking: (1) font metric, (2) spatial relationship, (3) column membership is a template for other block classifiers
## Verification Status
**ALL ACCEPTANCE CRITERIA PASS** - Implementation complete and tested.