pdftract/notes/pdftract-6bwq4.md
jedarden a14787794c feat(pdftract-6bwq4): implement baseline clustering algorithm
Implement cluster_spans_into_lines for Phase 4.2 line formation.
Groups spans into lines by baseline proximity using threshold 0.5 * median_font_size.

- Add HasFontSize trait for types with font_size
- Implement cluster_spans_into_lines function
  - Compute baseline for each span
  - Sort by baseline ASC
  - Sweep and cluster within threshold
  - Emit Line per cluster
  - Sort spans by x0 within each line
- Add finalize_line_cluster helper
- Export new items from layout module

Tests: All 11 acceptance criteria tests pass
- Spans baselines 100, 100.5, 105 with median 12: one line
- Spans baselines 100, 110 with median 12: two lines
- Superscript stays on same line as base text
- Empty input produces empty output
- Threshold is 0.5 * median_font_size (not hardcoded)

Closes: pdftract-6bwq4
2026-05-24 10:39:01 -04:00

2.5 KiB

pdftract-6bwq4: Baseline clustering algorithm implementation

Summary

Implemented cluster_spans_into_lines function for Phase 4.2 line formation. The function groups spans into lines by baseline proximity using a threshold of 0.5 * median_font_size.

Changes Made

crates/pdftract-core/src/layout/line.rs

  • Added HasFontSize trait for types that have font_size
  • Implemented cluster_spans_into_lines<S>(spans: Vec<S>, median_font_size: f32) -> Vec<Line<S>>
    • Computes baseline for each span using existing compute_baseline function
    • Sorts spans by baseline ASC
    • Sweeps through spans, clustering those within threshold (0.5 * median_font_size)
    • Emits one Line per cluster
    • Sorts spans by x0 within each line (left-to-right)
    • Computes line metadata: union bbox, average baseline, median font size
  • Added finalize_line_cluster helper function

crates/pdftract-core/src/layout/mod.rs

  • Exported HasFontSize trait and cluster_spans_into_lines function

Tests Added

All acceptance criteria tests pass:

  1. test_cluster_spans_baselines_100_100_5_105_median_12_one_line - Spans baselines 100, 100.5, 105 with median 12 (threshold 6): all one line. PASS
  2. test_cluster_spans_baselines_100_110_median_12_two_lines - Same with 100, 110: 2 lines (delta 10 > 6). PASS
  3. test_cluster_spans_superscript_stays_on_same_line - Superscript at 105, line baseline 100, font 12: SAME line. PASS
  4. test_cluster_spans_empty_input_empty_output - Empty input: empty output. PASS
  5. test_cluster_spans_threshold_is_0_5_times_median_font_size - INV: threshold = 0.5 * median_font_size; do NOT hardcode. PASS
  6. test_cluster_spans_sorted_by_x0_within_line - Spans within a line sorted by x0. PASS
  7. test_cluster_spans_two_column_at_same_y_one_line - Two-column at same y: cluster into one Line. PASS
  8. test_cluster_spans_union_bbox_computed_correctly - Union bbox computed correctly. PASS
  9. test_cluster_spans_baseline_computed_as_average - Baseline is average of member span baselines. PASS
  10. test_cluster_spans_median_font_size_computed - Median font size computed from line spans. PASS
  11. test_cluster_spans_single_span_single_line - Single span produces single line. PASS

Verification

  • cargo test -p pdftract-core --lib layout::line: 32 tests passed
  • cargo check -p pdftract-core --lib: Compiles successfully
  • cargo fmt -p pdftract-core: Code formatted

References

  • Plan: Phase 4.2 Algorithm step 2 (line 1667)
  • Bead: pdftract-6bwq4