Implement confirm_columns function that partitions page into candidate columns (regions between consecutive gaps + before-first + after-last), counts unique lines whose first span's x0 falls within each candidate's x-range, and promotes candidates with line_count >= 3 to confirmed columns. Supporting code: - ColumnGap struct with lo/hi bounds, width(), midpoint() - detect_column_gaps function for zero-coverage region detection - HasFirstSpan trait for first span bbox access - CandidateColumn struct for tracking x_range and line_count All 49 column tests pass, including all acceptance criteria. Bead: pdftract-2rkc1 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.7 KiB
pdftract-2rkc1: Column confirmation verification
Work completed
The confirm_columns function has been implemented in /home/coding/pdftract/crates/pdftract-core/src/layout/columns.rs (lines 252-332).
Implementation details
The implementation follows the algorithm specified in the plan:
-
No gaps case (lines 257-274): Entire page is one candidate column. Counts lines whose first span's x0 falls within page bounds. Returns single column if >= 3 lines.
-
Candidate column construction (lines 276-308):
- Before-first gap:
(0, gap_0.lo) - Between consecutive gaps:
(gap_i.hi + 1, gap_i+1.lo) - After-last gap:
(gap_last.hi + 1, page_width)
- Before-first gap:
-
Line counting (lines 310-321): For each line, gets first span's bbox via
HasFirstSpantrait, checks if x0 falls within candidate column range, counts unique lines. -
Column promotion (lines 324-329): Filters candidates with
line_count >= 3to confirmed columns, reassigns indices left-to-right.
Supporting code added
ColumnGapstruct (lines 89-116): Represents a gap in the x0 histogram with lo/hi bounds, width(), and midpoint() methods.detect_column_gapsfunction (lines 156-201): Finds zero-coverage regions >= 3% of page width.HasFirstSpantrait (lines 334-343): Trait for accessing first span's bbox.CandidateColumnstruct (lines 203-213): Internal tracking of x_range and line_count.Columnstruct (lines 372-396): Confirmed column with index and x_range.
Test results
All 49 column tests pass, including all acceptance criteria:
| Acceptance criteria | Test | Result |
|---|---|---|
| 2-column page with 30 lines each: both confirmed | test_confirm_columns_two_column_both_confirmed |
PASS |
| 2-column page with 30 lines + 2 lines: only 30-line column confirmed | test_confirm_columns_two_column_one_confirmed |
PASS |
| Single column: 1 candidate -> confirmed | test_confirm_columns_single_column_confirmed |
PASS |
| Empty page: 0 confirmed | test_confirm_columns_empty_page |
PASS |
Additional edge cases tested:
- Exactly 3 lines (boundary case): PASS
- Leading/trailing gaps: PASS
- Lines in gap unassigned: PASS
- Lines with no spans: PASS
- Three-column layouts: PASS
Invariants verified
- INV: 3-line minimum: The filter condition
c.line_count >= 3is enforced at line 326. - Lines in gaps remain unassigned: Lines whose first span's x0 falls in a gap region are not counted for any candidate column.
- "First span" = leftmost post-sort: The
HasFirstSpantrait provides the first (leftmost) span's bbox; within-line sorting is assumed to be done before callingconfirm_columns.
Command used
cargo nextest run -p pdftract-core 'columns::'
Result: 49 passed, 2382 skipped (0 failed)