Implement cluster_spans_into_lines for Phase 4.2 line formation. Groups spans into lines by baseline proximity using threshold 0.5 * median_font_size. - Add HasFontSize trait for types with font_size - Implement cluster_spans_into_lines function - Compute baseline for each span - Sort by baseline ASC - Sweep and cluster within threshold - Emit Line per cluster - Sort spans by x0 within each line - Add finalize_line_cluster helper - Export new items from layout module Tests: All 11 acceptance criteria tests pass - Spans baselines 100, 100.5, 105 with median 12: one line - Spans baselines 100, 110 with median 12: two lines - Superscript stays on same line as base text - Empty input produces empty output - Threshold is 0.5 * median_font_size (not hardcoded) Closes: pdftract-6bwq4
2.5 KiB
2.5 KiB
pdftract-6bwq4: Baseline clustering algorithm implementation
Summary
Implemented cluster_spans_into_lines function for Phase 4.2 line formation. The function groups spans into lines by baseline proximity using a threshold of 0.5 * median_font_size.
Changes Made
crates/pdftract-core/src/layout/line.rs
- Added
HasFontSizetrait for types that have font_size - Implemented
cluster_spans_into_lines<S>(spans: Vec<S>, median_font_size: f32) -> Vec<Line<S>>- Computes baseline for each span using existing
compute_baselinefunction - Sorts spans by baseline ASC
- Sweeps through spans, clustering those within threshold (0.5 * median_font_size)
- Emits one
Lineper cluster - Sorts spans by x0 within each line (left-to-right)
- Computes line metadata: union bbox, average baseline, median font size
- Computes baseline for each span using existing
- Added
finalize_line_clusterhelper function
crates/pdftract-core/src/layout/mod.rs
- Exported
HasFontSizetrait andcluster_spans_into_linesfunction
Tests Added
All acceptance criteria tests pass:
test_cluster_spans_baselines_100_100_5_105_median_12_one_line- Spans baselines 100, 100.5, 105 with median 12 (threshold 6): all one line. PASStest_cluster_spans_baselines_100_110_median_12_two_lines- Same with 100, 110: 2 lines (delta 10 > 6). PASStest_cluster_spans_superscript_stays_on_same_line- Superscript at 105, line baseline 100, font 12: SAME line. PASStest_cluster_spans_empty_input_empty_output- Empty input: empty output. PASStest_cluster_spans_threshold_is_0_5_times_median_font_size- INV: threshold = 0.5 * median_font_size; do NOT hardcode. PASStest_cluster_spans_sorted_by_x0_within_line- Spans within a line sorted by x0. PASStest_cluster_spans_two_column_at_same_y_one_line- Two-column at same y: cluster into one Line. PASStest_cluster_spans_union_bbox_computed_correctly- Union bbox computed correctly. PASStest_cluster_spans_baseline_computed_as_average- Baseline is average of member span baselines. PASStest_cluster_spans_median_font_size_computed- Median font size computed from line spans. PASStest_cluster_spans_single_span_single_line- Single span produces single line. PASS
Verification
cargo test -p pdftract-core --lib layout::line: 32 tests passedcargo check -p pdftract-core --lib: Compiles successfullycargo fmt -p pdftract-core: Code formatted
References
- Plan: Phase 4.2 Algorithm step 2 (line 1667)
- Bead: pdftract-6bwq4