Blocker identified: - Extraction pipeline (extract.rs) doesn't use Phase 4 layout pipeline - Column detection functions never called in production - SpanJson.column hardcoded to None (lines 1059, 1916) - No end-to-end tests for acceptance criteria Span struct HAS column field (line 179) but extraction doesn't use it. Coordinator CANNOT CLOSE - sub-phase not end-to-end functional.
3.8 KiB
Verification Note: pdftract-63ka2 (Updated 2026-05-28)
Bead
Phase 4.3: Column Detection (coordinator)
Current Status: BLOCKER - DO NOT CLOSE
Children Status
All 4 children are CLOSED:
pdftract-56vwd- x0 histogram builder - CLOSED ✓pdftract-14w0w- Gap detection - CLOSED ✓pdftract-2rkc1- Column confirmation - CLOSED ✓pdftract-64j83- Column label assignment - CLOSED ✓
Implementation Status
Column detection functions are fully implemented in crates/pdftract-core/src/layout/columns.rs:
build_x0_histogram()- 49 unit tests passdetect_column_gaps()- Part of the 49 testsconfirm_columns()- Part of the 49 testsassign_columns_to_spans()- Part of the 49 testsassign_columns_to_lines()- Part of the 49 tests
Integration Status: BLOCKER (As of 2026-05-28)
Column detection is NOT integrated into the main extraction pipeline:
-
Main
Spanstruct HAS column field but it's never used- File:
crates/pdftract-core/src/span/mod.rs:179 - The
Spanstruct DOES havecolumn: Option<u32>field (updated since initial note) - However, the extraction pipeline never assigns column values
- File:
-
Extraction pipeline does not call column detection
- File:
crates/pdftract-core/src/extract.rs - Column detection functions are never invoked (grep found no matches)
SpanJson::columnis hardcoded toNone(lines 1059, 1916)- The extraction pipeline doesn't use
cluster_spans_into_linesor column detection at all
- File:
-
No end-to-end tests for column detection
- No fixture tests for three-column papers
- No fixture tests for full-width headings above two-column body
- No fixture tests for single-column pages
Acceptance Criteria
- [PASS] All 4 children closed
- [FAIL] Three-column academic paper: three distinct columns detected - NOT VERIFIED
- [FAIL] Full-width heading above two-column body: heading spans not assigned a column - NOT VERIFIED
- [FAIL] Single-column page: no false column splits - NOT VERIFIED
Blockers
The extraction pipeline (extract.rs) needs to be refactored to use the Phase 4 layout pipeline:
-
Add Phase 4 pipeline integration
- File:
crates/pdftract-core/src/extract.rs - Currently the pipeline doesn't use line formation or column detection
- Need to add: glyph → span → line → column detection → block formation
- Current pipeline goes directly from glyphs to spans to blocks without line/column phases
- File:
-
Implement column detection call chain
- After line formation (Phase 4.2), call:
build_x0_histogram(spans, page_width)detect_column_gaps(&hist, page_width)confirm_columns(&gaps, page_width, &lines)assign_columns_to_spans(spans, &columns)assign_columns_to_lines(lines)
- Pass the column value to
SpanJsonconstructor
- After line formation (Phase 4.2), call:
-
Add end-to-end tests
- Create fixture for three-column academic paper
- Create fixture for two-column page with full-width heading
- Create fixture for single-column page
- Verify column detection produces correct labels
Recommendation
DO NOT CLOSE this coordinator bead. The sub-phase implementation is incomplete because:
- The extraction pipeline doesn't use the Phase 4 layout pipeline
- Column detection functions are never called in production
- No end-to-end verification of acceptance criteria
The child beads being closed only means the individual functions are implemented and unit-tested. The coordinator must ensure the sub-phase works end-to-end, which requires integration into the extraction pipeline.
Next Steps
This coordinator bead requires significant extraction pipeline refactoring:
- Integrate Phase 4.2 line formation into extract.rs
- Add Phase 4.3 column detection after line formation
- Update SpanJson to use computed column values
- Add fixture tests for acceptance criteria verification