pdftract/notes/pdftract-63ka2.md
jedarden 18af6bb01d docs(pdftract-63ka2): update verification note - extraction pipeline missing Phase 4 integration
Blocker identified:
- Extraction pipeline (extract.rs) doesn't use Phase 4 layout pipeline
- Column detection functions never called in production
- SpanJson.column hardcoded to None (lines 1059, 1916)
- No end-to-end tests for acceptance criteria

Span struct HAS column field (line 179) but extraction doesn't use it.
Coordinator CANNOT CLOSE - sub-phase not end-to-end functional.
2026-05-28 01:47:50 -04:00

3.8 KiB

Verification Note: pdftract-63ka2 (Updated 2026-05-28)

Bead

Phase 4.3: Column Detection (coordinator)

Current Status: BLOCKER - DO NOT CLOSE

Children Status

All 4 children are CLOSED:

  • pdftract-56vwd - x0 histogram builder - CLOSED ✓
  • pdftract-14w0w - Gap detection - CLOSED ✓
  • pdftract-2rkc1 - Column confirmation - CLOSED ✓
  • pdftract-64j83 - Column label assignment - CLOSED ✓

Implementation Status

Column detection functions are fully implemented in crates/pdftract-core/src/layout/columns.rs:

  • build_x0_histogram() - 49 unit tests pass
  • detect_column_gaps() - Part of the 49 tests
  • confirm_columns() - Part of the 49 tests
  • assign_columns_to_spans() - Part of the 49 tests
  • assign_columns_to_lines() - Part of the 49 tests

Integration Status: BLOCKER (As of 2026-05-28)

Column detection is NOT integrated into the main extraction pipeline:

  1. Main Span struct HAS column field but it's never used

    • File: crates/pdftract-core/src/span/mod.rs:179
    • The Span struct DOES have column: Option<u32> field (updated since initial note)
    • However, the extraction pipeline never assigns column values
  2. Extraction pipeline does not call column detection

    • File: crates/pdftract-core/src/extract.rs
    • Column detection functions are never invoked (grep found no matches)
    • SpanJson::column is hardcoded to None (lines 1059, 1916)
    • The extraction pipeline doesn't use cluster_spans_into_lines or column detection at all
  3. No end-to-end tests for column detection

    • No fixture tests for three-column papers
    • No fixture tests for full-width headings above two-column body
    • No fixture tests for single-column pages

Acceptance Criteria

  • [PASS] All 4 children closed
  • [FAIL] Three-column academic paper: three distinct columns detected - NOT VERIFIED
  • [FAIL] Full-width heading above two-column body: heading spans not assigned a column - NOT VERIFIED
  • [FAIL] Single-column page: no false column splits - NOT VERIFIED

Blockers

The extraction pipeline (extract.rs) needs to be refactored to use the Phase 4 layout pipeline:

  1. Add Phase 4 pipeline integration

    • File: crates/pdftract-core/src/extract.rs
    • Currently the pipeline doesn't use line formation or column detection
    • Need to add: glyph → span → line → column detection → block formation
    • Current pipeline goes directly from glyphs to spans to blocks without line/column phases
  2. Implement column detection call chain

    • After line formation (Phase 4.2), call:
      • build_x0_histogram(spans, page_width)
      • detect_column_gaps(&hist, page_width)
      • confirm_columns(&gaps, page_width, &lines)
      • assign_columns_to_spans(spans, &columns)
      • assign_columns_to_lines(lines)
    • Pass the column value to SpanJson constructor
  3. Add end-to-end tests

    • Create fixture for three-column academic paper
    • Create fixture for two-column page with full-width heading
    • Create fixture for single-column page
    • Verify column detection produces correct labels

Recommendation

DO NOT CLOSE this coordinator bead. The sub-phase implementation is incomplete because:

  1. The extraction pipeline doesn't use the Phase 4 layout pipeline
  2. Column detection functions are never called in production
  3. No end-to-end verification of acceptance criteria

The child beads being closed only means the individual functions are implemented and unit-tested. The coordinator must ensure the sub-phase works end-to-end, which requires integration into the extraction pipeline.

Next Steps

This coordinator bead requires significant extraction pipeline refactoring:

  1. Integrate Phase 4.2 line formation into extract.rs
  2. Add Phase 4.3 column detection after line formation
  3. Update SpanJson to use computed column values
  4. Add fixture tests for acceptance criteria verification