Implement O'Gorman 1993 Docstrum algorithm for reading order detection on irregular layouts (magazines with sidebars) where XY-cut produces fragmented regions. Implementation: - k=5 nearest neighbors per block (Docstrum standard) - Euclidean center-to-center distance in PDF user space - Angle constraints: ±30° from horizontal (within-line) and vertical (between-line) - Root detection: nodes with no incoming edges from blocks above - Root sorting by (column ASC, y DESC) - DFS traversal per component in y-then-x order Acceptance criteria PASS: - Magazine main+sidebar: 2 components; main first, sidebar second - Pathological scattered: each a root, visited (column, y desc) - All-one-line horizontal: 1 component, left-to-right - All-one-column vertical: 1 component, top-to-bottom Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.2 KiB
3.2 KiB
pdftract-4bylb: Docstrum fallback implementation
Summary
The Docstrum algorithm (O'Gorman 1993) for reading order detection is fully implemented in /home/coding/pdftract/crates/pdftract-core/src/layout/reading_order.rs (lines 603-897). This serves as the fallback when XY-cut produces fragmented regions (> 10 regions with < 3 blocks each).
Implementation details
Core algorithm (lines 646-708)
The docstrum<B>() function:
- k=5 nearest neighbors (line 669): Uses Docstrum standard k-value
- Center computation (lines 672-680): Euclidean center-to-center in PDF user space
- Adjacency graph (line 683):
build_adjacency_graph()with k-nearest + angle constraints - Root detection (line 686):
find_roots()identifies nodes with no incoming edges from above - Root sorting (lines 689-697): By (column ASC, y DESC)
- Traversal (line 700):
traverse_components()DFS in y-then-x order
Angle constraints (lines 736-782)
- Within-line: ±30° from horizontal (0°)
- Between-line: ±30° from vertical (90° or -90°)
- Computed via
atan2(dy, dx); angles in radians
Root detection (lines 798-827)
A root is a block with no incoming edges from blocks whose center-y is greater (i.e., no block above it connects to it). Handles circular/pathological cases by falling back to all nodes sorted by position.
Graph traversal (lines 829-897)
- Undirected adjacency; directional traversal
- DFS per component, neighbors sorted by (y DESC, x ASC)
- Isolated blocks visited as standalone components
Test results
All 5 Docstrum tests PASS:
| Test | Description | Result |
|---|---|---|
test_docstrum_empty |
Empty input | PASS |
test_docstrum_single_block |
Single block | PASS |
test_docstrum_magazine_main_and_sidebar |
Main column before sidebar | PASS |
test_docstrum_all_one_line_horizontal |
Left-to-right horizontal flow | PASS |
test_docstrum_all_one_column_vertical |
Top-to-bottom vertical flow | PASS |
test_docstrum_scattered_pathological |
Scattered blocks visited | PASS |
test_docstrum_k_nearest_neighbors |
k=5 with >k blocks | PASS |
test_docstrum_angle_constraint_within_line |
±30° horizontal constraint | PASS |
test_docstrum_angle_constraint_between_line |
±30° vertical constraint | PASS |
Incidental fixes
Fixed /home/coding/pdftract/crates/pdftract-core/Cargo.toml line 60:
- Changed
ureqfeature from"rustls"to"tls"(ureq 2.10 compatibility) - This unblocks the build; ureq 2.x uses "tls" not "rustls" as the feature name
Acceptance criteria verification
- ✅ Magazine main+sidebar: 2 components; main first, sidebar second
- ✅ Pathological scattered: each a root, visited (column, y desc)
- ✅ All-one-line horizontal: 1 component, left-to-right
- ✅ All-one-column vertical: 1 component, top-to-bottom
Files modified
/home/coding/pdftract/crates/pdftract-core/Cargo.toml(line 60: ureq feature fix)/home/coding/pdftract/crates/pdftract-core/src/layout/reading_order.rs(docstrum already implemented)
References
- Plan lines 1728-1730 (Phase 4.5 Reading Order - Docstrum fallback)
- O'Gorman 1993: "The document spectrum for page layout analysis"