From 18af6bb01dd1006ad218cb4bfffca26fd88a8b4c Mon Sep 17 00:00:00 2001 From: jedarden Date: Thu, 28 May 2026 01:47:17 -0400 Subject: [PATCH] docs(pdftract-63ka2): update verification note - extraction pipeline missing Phase 4 integration Blocker identified: - Extraction pipeline (extract.rs) doesn't use Phase 4 layout pipeline - Column detection functions never called in production - SpanJson.column hardcoded to None (lines 1059, 1916) - No end-to-end tests for acceptance criteria Span struct HAS column field (line 179) but extraction doesn't use it. Coordinator CANNOT CLOSE - sub-phase not end-to-end functional. --- notes/pdftract-63ka2.md | 49 +++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 22 deletions(-) diff --git a/notes/pdftract-63ka2.md b/notes/pdftract-63ka2.md index 4d9f554..b173073 100644 --- a/notes/pdftract-63ka2.md +++ b/notes/pdftract-63ka2.md @@ -1,9 +1,9 @@ -# Verification Note: pdftract-63ka2 +# Verification Note: pdftract-63ka2 (Updated 2026-05-28) ## Bead Phase 4.3: Column Detection (coordinator) -## Current State +## Current Status: BLOCKER - DO NOT CLOSE ### Children Status All 4 children are CLOSED: @@ -20,20 +20,20 @@ Column detection functions are fully implemented in `crates/pdftract-core/src/la - `assign_columns_to_spans()` - Part of the 49 tests - `assign_columns_to_lines()` - Part of the 49 tests -### Integration Status: BLOCKER +### Integration Status: BLOCKER (As of 2026-05-28) Column detection is NOT integrated into the main extraction pipeline: -1. **Main `Span` struct missing column field** - - File: `crates/pdftract-core/src/span/mod.rs` - - The `Span` struct does NOT have a `column: Option` field - - Child bead `pdftract-64j83` added the column field to `HybridHybridSpan` (hybrid.rs) instead - - `HybridHybridSpan` is used for hybrid pages (mixed vector/scanned content), not the main pipeline +1. **Main `Span` struct HAS column field but it's never used** + - File: `crates/pdftract-core/src/span/mod.rs:179` + - The `Span` struct DOES have `column: Option` field (updated since initial note) + - However, the extraction pipeline never assigns column values 2. **Extraction pipeline does not call column detection** - File: `crates/pdftract-core/src/extract.rs` - - Column detection functions are never invoked + - Column detection functions are never invoked (grep found no matches) - `SpanJson::column` is hardcoded to `None` (lines 1059, 1916) + - The extraction pipeline doesn't use `cluster_spans_into_lines` or column detection at all 3. **No end-to-end tests for column detection** - No fixture tests for three-column papers @@ -49,13 +49,16 @@ Column detection is NOT integrated into the main extraction pipeline: ## Blockers -1. **Add `column: Option` field to main `Span` struct** - - File: `crates/pdftract-core/src/span/mod.rs` - - Update `Span::new()` to initialize the field +The extraction pipeline (`extract.rs`) needs to be refactored to use the Phase 4 layout pipeline: -2. **Integrate column detection into extraction pipeline** +1. **Add Phase 4 pipeline integration** - File: `crates/pdftract-core/src/extract.rs` - - After line formation (Phase 4.2), call column detection: + - Currently the pipeline doesn't use line formation or column detection + - Need to add: glyph → span → line → column detection → block formation + - Current pipeline goes directly from glyphs to spans to blocks without line/column phases + +2. **Implement column detection call chain** + - After line formation (Phase 4.2), call: - `build_x0_histogram(spans, page_width)` - `detect_column_gaps(&hist, page_width)` - `confirm_columns(&gaps, page_width, &lines)` @@ -71,15 +74,17 @@ Column detection is NOT integrated into the main extraction pipeline: ## Recommendation -DO NOT CLOSE this coordinator bead. The sub-phase implementation is incomplete because: -1. The main `Span` struct lacks the column field -2. The extraction pipeline does not call column detection +**DO NOT CLOSE this coordinator bead.** The sub-phase implementation is incomplete because: +1. The extraction pipeline doesn't use the Phase 4 layout pipeline +2. Column detection functions are never called in production 3. No end-to-end verification of acceptance criteria -The child beads being closed only means the individual functions are implemented. The coordinator must ensure the sub-phase works end-to-end, which requires integration into the extraction pipeline. +The child beads being closed only means the individual functions are implemented and unit-tested. The coordinator must ensure the sub-phase works end-to-end, which requires integration into the extraction pipeline. -## Files Requiring Changes +## Next Steps -1. `crates/pdftract-core/src/span/mod.rs` - Add `column: Option` to `Span` -2. `crates/pdftract-core/src/extract.rs` - Integrate column detection pipeline -3. `crates/pdftract-core/tests/` or `crates/pdftract-cli/tests/` - Add fixture tests +This coordinator bead requires significant extraction pipeline refactoring: +1. Integrate Phase 4.2 line formation into extract.rs +2. Add Phase 4.3 column detection after line formation +3. Update SpanJson to use computed column values +4. Add fixture tests for acceptance criteria verification