pdftract/notes/pdftract-fy89c.md
jedarden 508ca5d0bb feat(pdftract-fy89c): implement line-to-block heuristic detector with 5 ordered triggers
Implement Phase 4.4 block formation with 5 ordered heuristics for grouping
lines into semantic blocks (paragraphs, headings, etc.):

1. Vertical gap > 1.5 * line_height → new block
2. Indent change > 0.03 * column_width → new block
3. Font size change > 1pt → new block
4. Rendering mode change → new block
5. Column boundary → MANDATORY block break

Changes:
- Extended Line<S> with median_font_size, rendering_mode, column fields
- Added LineMetadata trait for abstracting line representations
- Added Block<S> and BlockInput<L> structs for block representation
- Implemented group_lines_into_blocks() with column-aware sorting

All acceptance criteria tests pass (21/21).

Closes: pdftract-fy89c
2026-05-24 06:14:43 -04:00

71 lines
2.9 KiB
Markdown

# Verification Note: pdftract-fy89c
## Bead
Line-to-block heuristic detector (5 break triggers in order)
## Implementation
### Files Modified
- `crates/pdftract-core/src/layout/line.rs`
- `crates/pdftract-core/src/layout/mod.rs`
### Changes Made
1. **Extended `Line<S>` struct** with new fields:
- `median_font_size: f32` - median font size of spans in the line
- `rendering_mode: Option<u32>` - PDF text rendering mode (Tr operator)
- `column: Option<usize>` - column index assigned by Phase 4.3
2. **Added `LineMetadata` trait** - abstracts over different line representations for block formation
3. **Added `Block<S>` struct** - represents a block of text composed of one or more lines
4. **Added `BlockInput<L>` struct** - internal block representation used during formation
5. **Implemented `group_lines_into_blocks()` function** with 5 ordered heuristics:
- **Trigger 1:** Vertical gap > 1.5 * line_height → new block
- **Trigger 2:** Indent change > 0.03 * column_width → new block
- **Trigger 3:** Font size change > 1pt → new block
- **Trigger 4:** Rendering mode change → new block
- **Trigger 5:** Column boundary → MANDATORY block break
### Key Implementation Details
- Lines are sorted by (column ASC, baseline DESC) before processing
- Column changes are MANDATORY block breaks (per INV in bead description)
- Line height is computed as baseline-to-baseline distance
- Vertical gap is computed as previous baseline minus current baseline
- Block state (avg_x0, median_font_size, rendering_mode, column) is tracked per block
### Tests Added
All acceptance criteria tests pass:
1. `test_five_lines_equal_spacing_one_block` - 5 lines with equal spacing/font → 1 block ✓
2. `test_thirty_pt_gap_creates_two_blocks` - 30pt gap → 2 blocks ✓
3. `test_heading_18pt_above_12pt_body_two_blocks` - Font size change (18pt vs 12pt) → 2 blocks ✓
4. `test_two_column_separate_blocks` - Column boundary → 2 blocks ✓
5. `test_indented_first_line_new_block` - Indent change (>9pt offset, 300pt column_width) → 2 blocks ✓
6. `test_rendering_mode_change_creates_new_block` - Rendering mode change → 2 blocks ✓
7. `test_empty_lines_returns_empty_blocks` - Empty input → empty blocks ✓
8. `test_single_line_returns_single_block` - Single line → single block ✓
9. `test_lines_sorted_by_column_then_baseline` - Sorting verification ✓
## Acceptance Criteria
- [PASS] 5 lines equal spacing/font: 1 block
- [PASS] 5 lines, 30pt gap, 5 more: 2 blocks
- [PASS] Heading 18pt above 12pt body: 2 blocks
- [PASS] Two-column: lines in col 0 separate from col 1
- [PASS] Indented first line (>9pt offset, 300pt column_width): NEW BLOCK starts
## Gates Passed
- [PASS] `cargo check --all-targets`
- [PASS] `cargo fmt`
- [PASS] `cargo test --package pdftract-core --lib layout::line` (21/21 tests passed)
## References
- Plan section: Phase 4.4 Heuristics (lines 1694-1699)
- Bead ID: pdftract-fy89c