pdftract/notes/pdftract-2rkc1.md
jedarden fda17d4d77 feat(pdftract-2rkc1): implement column confirmation with >= 3 line threshold
Implement confirm_columns function that partitions page into candidate
columns (regions between consecutive gaps + before-first + after-last),
counts unique lines whose first span's x0 falls within each candidate's
x-range, and promotes candidates with line_count >= 3 to confirmed columns.

Supporting code:
- ColumnGap struct with lo/hi bounds, width(), midpoint()
- detect_column_gaps function for zero-coverage region detection
- HasFirstSpan trait for first span bbox access
- CandidateColumn struct for tracking x_range and line_count

All 49 column tests pass, including all acceptance criteria.

Bead: pdftract-2rkc1

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:09:01 -04:00

60 lines
2.7 KiB
Markdown

# pdftract-2rkc1: Column confirmation verification
## Work completed
The `confirm_columns` function has been implemented in `/home/coding/pdftract/crates/pdftract-core/src/layout/columns.rs` (lines 252-332).
## Implementation details
The implementation follows the algorithm specified in the plan:
1. **No gaps case** (lines 257-274): Entire page is one candidate column. Counts lines whose first span's x0 falls within page bounds. Returns single column if >= 3 lines.
2. **Candidate column construction** (lines 276-308):
- Before-first gap: `(0, gap_0.lo)`
- Between consecutive gaps: `(gap_i.hi + 1, gap_i+1.lo)`
- After-last gap: `(gap_last.hi + 1, page_width)`
3. **Line counting** (lines 310-321): For each line, gets first span's bbox via `HasFirstSpan` trait, checks if x0 falls within candidate column range, counts unique lines.
4. **Column promotion** (lines 324-329): Filters candidates with `line_count >= 3` to confirmed columns, reassigns indices left-to-right.
## Supporting code added
- `ColumnGap` struct (lines 89-116): Represents a gap in the x0 histogram with lo/hi bounds, width(), and midpoint() methods.
- `detect_column_gaps` function (lines 156-201): Finds zero-coverage regions >= 3% of page width.
- `HasFirstSpan` trait (lines 334-343): Trait for accessing first span's bbox.
- `CandidateColumn` struct (lines 203-213): Internal tracking of x_range and line_count.
- `Column` struct (lines 372-396): Confirmed column with index and x_range.
## Test results
All 49 column tests pass, including all acceptance criteria:
| Acceptance criteria | Test | Result |
|-------------------|------|--------|
| 2-column page with 30 lines each: both confirmed | `test_confirm_columns_two_column_both_confirmed` | PASS |
| 2-column page with 30 lines + 2 lines: only 30-line column confirmed | `test_confirm_columns_two_column_one_confirmed` | PASS |
| Single column: 1 candidate -> confirmed | `test_confirm_columns_single_column_confirmed` | PASS |
| Empty page: 0 confirmed | `test_confirm_columns_empty_page` | PASS |
Additional edge cases tested:
- Exactly 3 lines (boundary case): PASS
- Leading/trailing gaps: PASS
- Lines in gap unassigned: PASS
- Lines with no spans: PASS
- Three-column layouts: PASS
## Invariants verified
- **INV: 3-line minimum**: The filter condition `c.line_count >= 3` is enforced at line 326.
- **Lines in gaps remain unassigned**: Lines whose first span's x0 falls in a gap region are not counted for any candidate column.
- **"First span" = leftmost post-sort**: The `HasFirstSpan` trait provides the first (leftmost) span's bbox; within-line sorting is assumed to be done before calling `confirm_columns`.
## Command used
```bash
cargo nextest run -p pdftract-core 'columns::'
```
Result: 49 passed, 2382 skipped (0 failed)