pdftract/notes/pdftract-6bwq4.md
jedarden a14787794c feat(pdftract-6bwq4): implement baseline clustering algorithm
Implement cluster_spans_into_lines for Phase 4.2 line formation.
Groups spans into lines by baseline proximity using threshold 0.5 * median_font_size.

- Add HasFontSize trait for types with font_size
- Implement cluster_spans_into_lines function
  - Compute baseline for each span
  - Sort by baseline ASC
  - Sweep and cluster within threshold
  - Emit Line per cluster
  - Sort spans by x0 within each line
- Add finalize_line_cluster helper
- Export new items from layout module

Tests: All 11 acceptance criteria tests pass
- Spans baselines 100, 100.5, 105 with median 12: one line
- Spans baselines 100, 110 with median 12: two lines
- Superscript stays on same line as base text
- Empty input produces empty output
- Threshold is 0.5 * median_font_size (not hardcoded)

Closes: pdftract-6bwq4
2026-05-24 10:39:01 -04:00

47 lines
2.5 KiB
Markdown

# pdftract-6bwq4: Baseline clustering algorithm implementation
## Summary
Implemented `cluster_spans_into_lines` function for Phase 4.2 line formation. The function groups spans into lines by baseline proximity using a threshold of `0.5 * median_font_size`.
## Changes Made
### crates/pdftract-core/src/layout/line.rs
- Added `HasFontSize` trait for types that have font_size
- Implemented `cluster_spans_into_lines<S>(spans: Vec<S>, median_font_size: f32) -> Vec<Line<S>>`
- Computes baseline for each span using existing `compute_baseline` function
- Sorts spans by baseline ASC
- Sweeps through spans, clustering those within threshold (0.5 * median_font_size)
- Emits one `Line` per cluster
- Sorts spans by x0 within each line (left-to-right)
- Computes line metadata: union bbox, average baseline, median font size
- Added `finalize_line_cluster` helper function
### crates/pdftract-core/src/layout/mod.rs
- Exported `HasFontSize` trait and `cluster_spans_into_lines` function
## Tests Added
All acceptance criteria tests pass:
1. `test_cluster_spans_baselines_100_100_5_105_median_12_one_line` - Spans baselines 100, 100.5, 105 with median 12 (threshold 6): all one line. PASS
2. `test_cluster_spans_baselines_100_110_median_12_two_lines` - Same with 100, 110: 2 lines (delta 10 > 6). PASS
3. `test_cluster_spans_superscript_stays_on_same_line` - Superscript at 105, line baseline 100, font 12: SAME line. PASS
4. `test_cluster_spans_empty_input_empty_output` - Empty input: empty output. PASS
5. `test_cluster_spans_threshold_is_0_5_times_median_font_size` - INV: threshold = 0.5 * median_font_size; do NOT hardcode. PASS
6. `test_cluster_spans_sorted_by_x0_within_line` - Spans within a line sorted by x0. PASS
7. `test_cluster_spans_two_column_at_same_y_one_line` - Two-column at same y: cluster into one Line. PASS
8. `test_cluster_spans_union_bbox_computed_correctly` - Union bbox computed correctly. PASS
9. `test_cluster_spans_baseline_computed_as_average` - Baseline is average of member span baselines. PASS
10. `test_cluster_spans_median_font_size_computed` - Median font size computed from line spans. PASS
11. `test_cluster_spans_single_span_single_line` - Single span produces single line. PASS
## Verification
- `cargo test -p pdftract-core --lib layout::line`: 32 tests passed
- `cargo check -p pdftract-core --lib`: Compiles successfully
- `cargo fmt -p pdftract-core`: Code formatted
## References
- Plan: Phase 4.2 Algorithm step 2 (line 1667)
- Bead: pdftract-6bwq4