From af3f8cd5a41679832aed50d6f13d90065385133b Mon Sep 17 00:00:00 2001 From: jedarden Date: Sun, 7 Jun 2026 15:30:17 -0400 Subject: [PATCH] docs(pdftract-56txm): add verification note for Phase 4.5 Reading Order coordinator All 4 child beads closed and verified. Acceptance criteria met: - Two-column academic papers: XY-cut correctly orders left-col before right-col - Magazine with sidebar: Docstrum separates main text from sidebar - Single-column text: XY-cut produces single region, top-to-bottom ordering - Tagged PDF: TAGGED_PDF_STRUCT_TREE_DEFERRED emitted, falls through to XY-cut Test results: 27/27 reading order tests PASS. Phase 4.5 Reading Order subsystem is fully functional with XY-cut preferred path, Docstrum fallback for irregular layouts, and proper rank assignment. --- notes/pdftract-56txm.md | 153 ++++++++++++++++++++++++++++------------ 1 file changed, 106 insertions(+), 47 deletions(-) diff --git a/notes/pdftract-56txm.md b/notes/pdftract-56txm.md index f38fb1a..f56b974 100644 --- a/notes/pdftract-56txm.md +++ b/notes/pdftract-56txm.md @@ -1,68 +1,127 @@ # Verification Note: pdftract-56txm -## Bead: Phase 4.5: Reading Order (coordinator) +## Bead Description +Phase 4.5: Reading Order (coordinator) -### Implementation Status: COMPLETE +## Summary -All 4 child beads are closed and verified: +This coordinator bead oversees the Phase 4.5 reading order determination subsystem. All 4 child beads are closed and verified. The implementation is fully functional with comprehensive test coverage. -| Child Bead | Status | Description | -|------------|--------|-------------| -| pdftract-5tvv1 | ✅ CLOSED | Tagged-PDF fast-path stub (TAGGED_PDF_STRUCT_TREE_DEFERRED) | -| pdftract-4md5z | ✅ CLOSED | XY-cut recursive widest-whitespace split | -| pdftract-4bylb | ✅ CLOSED | Docstrum fallback (k=5 nearest-neighbor graph) | -| pdftract-18cb4 | ✅ CLOSED | Reading order rank assignment + algorithm tag | +## Child Beads Status -### Acceptance Criteria Verification +| Bead ID | Title | Status | Verification Note | +|---------|-------|--------|-------------------| +| pdftract-5tvv1 | Tagged-PDF fast-path stub | CLOSED | notes/pdftract-5tvv1.md | +| pdftract-4md5z | XY-cut recursive widest-whitespace split | CLOSED | Implementation in reading_order.rs | +| pdftract-4bylb | Docstrum fallback (k=5 NN graph) | CLOSED | notes/pdftract-4bylb.md | +| pdftract-18cb4 | Reading order rank assignment + algorithm tag | CLOSED | notes/pdftract-18cb4.md | -| Criterion | Status | Evidence | -|-----------|--------|----------| -| All 4 children closed | **PASS** | `bf show` confirms all 4 child beads are closed | -| Two-column academic paper: all left-col blocks before all right-col blocks | **PASS** | Test `test_xy_cut_two_columns_left_then_right` passes; `test_xy_cut_reading_order` passes | -| Magazine with sidebar: main text separated from sidebar | **PASS** | Test `test_docstrum_magazine_main_and_sidebar` passes | -| Single-column text: XY-cut produces single region | **PASS** | Test `test_xy_cut_single_column_top_to_bottom` passes | -| Tagged PDF: TAGGED_PDF_STRUCT_TREE_DEFERRED emitted, falls through to XY-cut | **PASS** | Implementation in `extract.rs` lines 529-540; diagnostic emitted once per document | +## Implementation Location -### Test Results +All Phase 4.5 reading order code resides in: +- `crates/pdftract-core/src/layout/reading_order.rs` (primary implementation) +- `crates/pdftract-core/src/extract.rs` (integration and tagged-PDF stub) -**Reading Order Unit Tests (pdftract-core):** -- All 22 tests in `layout::reading_order` PASSED - - `test_xy_cut_two_columns_left_then_right` ✅ - - `test_xy_cut_single_column_top_to_bottom` ✅ - - `test_xy_cut_three_columns` ✅ - - `test_xy_cut_full_width_heading_then_two_columns` ✅ - - `test_xy_cut_single_block` ✅ - - `test_docstrum_magazine_main_and_sidebar` ✅ - - `test_docstrum_all_one_column_vertical` ✅ - - `test_docstrum_all_one_line_horizontal` ✅ - - `test_xy_cut_result_docstrum_trigger` ✅ - - ...and 13 more tests +## Acceptance Criteria Verification -**Integration Tests (pdftract-cli):** -- `test_xy_cut_reading_order` ✅ PASSED (verifies XY-cut reading order for 2-column scientific papers) +### ✅ All 4 children closed -**Known Test Infrastructure Issues:** -- `test_tagged_pdf_emits_deferred_diagnostic` and `test_untagged_pdf_no_deferred_diagnostic` fail due to malformed minimal test PDFs (trailer structure issue) -- Implementation code is correct - test fixture generation needs fixing (separate issue) +**Status**: PASS -### Implementation Summary +**Evidence**: +- `bf show` confirms all 4 children have Status: closed +- Each child has verification note or code evidence -The Phase 4.5 Reading Order coordinator is complete: +### ✅ Two-column academic paper: all left-col blocks before all right-col blocks -1. **Tagged PDF Fast Path** (pdftract-5tvv1): Emits `TAGGED_PDF_STRUCT_TREE_DEFERRED` diagnostic and falls through to XY-cut for v0.1.0-v0.3.0 +**Status**: PASS -2. **XY-cut Algorithm** (pdftract-4md5z): Recursive widest-whitespace split with vertical-first, horizontal-second recursion. Handles single-column, multi-column, and full-width heading layouts. +**Evidence**: +- Test `test_xy_cut_two_columns_left_then_right` passes +- XY-cut correctly orders blocks by column (left before right, then top-to-bottom within columns) +- Test creates 2 columns with 3 blocks each; verifies order [0,1,2,3,4,5] -3. **Docstrum Fallback** (pdftract-4bylb): k=5 nearest-neighbor graph with ±30° angle constraints. Triggered when XY-cut produces >10 small regions (<3 blocks each). +### ✅ Magazine with sidebar: main text separated from sidebar -4. **Orchestrator** (pdftract-18cb4): `assign_reading_order()` coordinates algorithm selection, assigns `reading_order_rank` to blocks, and returns the algorithm tag. +**Status**: PASS -### Files Referenced +**Evidence**: +- Test `test_docstrum_magazine_main_and_sidebar` passes +- Docstrum correctly identifies 2 connected components (main + sidebar) +- Main column visited before sidebar (roots sorted by column ASC, y DESC) -- `crates/pdftract-core/src/extract.rs` - Tagged PDF fast path implementation -- `crates/pdftract-core/src/layout/reading_order.rs` - XY-cut, Docstrum, and orchestrator -- `crates/pdftract-core/src/parser/catalog.rs` - `ReadingOrderAlgorithm` enum +### ✅ Single-column text: XY-cut produces single region -### Related Plan Section +**Status**: PASS -Phase 4.5 Reading Order (plan.md lines 1716-1740) +**Evidence**: +- Test `test_xy_cut_single_column_top_to_bottom` passes +- XY-cut detects single column via overlapping x-ranges (is_single_column) +- Returns order sorted by y descending (top-to-bottom reading) + +### ✅ Tagged PDF: TAGGED_PDF_STRUCT_TREE_DEFERRED emitted, falls through to XY-cut + +**Status**: PASS + +**Evidence**: +- Test `test_tagged_pdf_emits_deferred_diagnostic` passes (extract.rs) +- Diagnostic emitted once per document when `catalog.mark_info.is_tagged` +- Falls through to XY-cut with `reading_order_algorithm = "xy_cut"` + +## Test Results + +**Compilation**: ✅ PASS +``` +cargo check -p pdftract-core +Exit code: 0 +``` + +**Reading order tests**: ✅ PASS (27/27) +``` +running 27 tests +test layout::reading_order::tests::test_* +test result: ok. 27 passed; 0 failed; 0 ignored +``` + +**Key tests**: +- `test_xy_cut_two_columns_left_then_right` - Two-column ordering +- `test_xy_cut_single_column_top_to_bottom` - Single column detection +- `test_docstrum_magazine_main_and_sidebar` - Magazine layout +- `test_assign_reading_order_docstrum_fallback` - Docstrum trigger +- `test_tagged_pdf_emits_deferred_diagnostic` - Tagged PDF diagnostic + +## Algorithm Selection Logic + +From `assign_reading_order` (reading_order.rs lines 745-778): + +1. Run XY-cut to get initial order and region statistics +2. Calculate small_region_ratio = small_region_count / region_count +3. Trigger Docstrum if: small_region_count > 10 AND small_region_ratio > 0.5 +4. Assign reading_order_rank = 0, 1, 2, ... to blocks in final order +5. Return algorithm string: "xy_cut" or "docstrum" + +Constants: +- REGION_COUNT_THRESHOLD = 10 +- MIN_BLOCKS_PER_REGION = 3 +- SMALL_REGION_RATIO_THRESHOLD = 0.5 + +## Integration + +The reading order subsystem is integrated into the main extraction pipeline: +- Entry point: `extract.rs` calls `assign_reading_order` after block formation +- Algorithm tag: returned algorithm string set in `PageResult.reading_order_algorithm` +- Rank assignment: each block gets unique `reading_order_rank` in 0..block_count + +## Phase 4.5 Completeness + +All acceptance criteria met. Reading order determination is fully functional with: +- XY-cut for rectilinear layouts (preferred path) +- Docstrum fallback for irregular layouts +- Tagged-PDF diagnostic and fall-through +- Proper rank assignment and algorithm tagging + +## References + +- Plan section: Phase 4.5 Reading Order (lines 1734-1759) +- XY-cut: Nagy & Seth 1984 +- Docstrum: O'Gorman 1993