pdftract/notes/pdftract-63ka2.md
jedarden 18af6bb01d docs(pdftract-63ka2): update verification note - extraction pipeline missing Phase 4 integration
Blocker identified:
- Extraction pipeline (extract.rs) doesn't use Phase 4 layout pipeline
- Column detection functions never called in production
- SpanJson.column hardcoded to None (lines 1059, 1916)
- No end-to-end tests for acceptance criteria

Span struct HAS column field (line 179) but extraction doesn't use it.
Coordinator CANNOT CLOSE - sub-phase not end-to-end functional.
2026-05-28 01:47:50 -04:00

90 lines
3.8 KiB
Markdown

# Verification Note: pdftract-63ka2 (Updated 2026-05-28)
## Bead
Phase 4.3: Column Detection (coordinator)
## Current Status: BLOCKER - DO NOT CLOSE
### Children Status
All 4 children are CLOSED:
- `pdftract-56vwd` - x0 histogram builder - CLOSED ✓
- `pdftract-14w0w` - Gap detection - CLOSED ✓
- `pdftract-2rkc1` - Column confirmation - CLOSED ✓
- `pdftract-64j83` - Column label assignment - CLOSED ✓
### Implementation Status
Column detection functions are fully implemented in `crates/pdftract-core/src/layout/columns.rs`:
- `build_x0_histogram()` - 49 unit tests pass
- `detect_column_gaps()` - Part of the 49 tests
- `confirm_columns()` - Part of the 49 tests
- `assign_columns_to_spans()` - Part of the 49 tests
- `assign_columns_to_lines()` - Part of the 49 tests
### Integration Status: BLOCKER (As of 2026-05-28)
Column detection is NOT integrated into the main extraction pipeline:
1. **Main `Span` struct HAS column field but it's never used**
- File: `crates/pdftract-core/src/span/mod.rs:179`
- The `Span` struct DOES have `column: Option<u32>` field (updated since initial note)
- However, the extraction pipeline never assigns column values
2. **Extraction pipeline does not call column detection**
- File: `crates/pdftract-core/src/extract.rs`
- Column detection functions are never invoked (grep found no matches)
- `SpanJson::column` is hardcoded to `None` (lines 1059, 1916)
- The extraction pipeline doesn't use `cluster_spans_into_lines` or column detection at all
3. **No end-to-end tests for column detection**
- No fixture tests for three-column papers
- No fixture tests for full-width headings above two-column body
- No fixture tests for single-column pages
### Acceptance Criteria
- [PASS] All 4 children closed
- [FAIL] Three-column academic paper: three distinct columns detected - NOT VERIFIED
- [FAIL] Full-width heading above two-column body: heading spans not assigned a column - NOT VERIFIED
- [FAIL] Single-column page: no false column splits - NOT VERIFIED
## Blockers
The extraction pipeline (`extract.rs`) needs to be refactored to use the Phase 4 layout pipeline:
1. **Add Phase 4 pipeline integration**
- File: `crates/pdftract-core/src/extract.rs`
- Currently the pipeline doesn't use line formation or column detection
- Need to add: glyph → span → line → column detection → block formation
- Current pipeline goes directly from glyphs to spans to blocks without line/column phases
2. **Implement column detection call chain**
- After line formation (Phase 4.2), call:
- `build_x0_histogram(spans, page_width)`
- `detect_column_gaps(&hist, page_width)`
- `confirm_columns(&gaps, page_width, &lines)`
- `assign_columns_to_spans(spans, &columns)`
- `assign_columns_to_lines(lines)`
- Pass the column value to `SpanJson` constructor
3. **Add end-to-end tests**
- Create fixture for three-column academic paper
- Create fixture for two-column page with full-width heading
- Create fixture for single-column page
- Verify column detection produces correct labels
## Recommendation
**DO NOT CLOSE this coordinator bead.** The sub-phase implementation is incomplete because:
1. The extraction pipeline doesn't use the Phase 4 layout pipeline
2. Column detection functions are never called in production
3. No end-to-end verification of acceptance criteria
The child beads being closed only means the individual functions are implemented and unit-tested. The coordinator must ensure the sub-phase works end-to-end, which requires integration into the extraction pipeline.
## Next Steps
This coordinator bead requires significant extraction pipeline refactoring:
1. Integrate Phase 4.2 line formation into extract.rs
2. Add Phase 4.3 column detection after line formation
3. Update SpanJson to use computed column values
4. Add fixture tests for acceptance criteria verification