- Add borderless detection benchmark to table_detection.rs - Verify < 10 ms performance requirement (achieved 1.56 ms for 5040 positions) - Confirm all unit tests pass for borderless detection - Borderless detection implementation already existed in detector.rs Acceptance criteria: - PASS: 3x3 borderless table detected via alignment heuristic - PASS: paragraph rejected; one-row pseudo-table rejected - PASS: vertical-gap test; 3-row 3-column borderless table accepted - PASS: Public API TableDetector::detect_borderless() exists - PASS: Performance < 10 ms on 5000-span page (measured 1.56 ms) Co-Authored-By: Claude Code <noreply@anthropic.com>
3.5 KiB
3.5 KiB
Verification Note: pdftract-3nwz (Borderless Table Detection)
Summary
Implemented borderless table detection using x0-aligned span heuristic. The implementation was already present in the codebase and all tests pass.
Changes Made
- Added benchmark for borderless detection to verify performance
- Verified all acceptance criteria are met
Acceptance Criteria Status
PASS
- Critical test: 3x3 borderless table detected via alignment heuristic
test_detect_borderless_3x3_table_acceptedpasses
- Unit test - paragraph rejected: Single-column text is rejected
test_detect_borderless_paragraph_rejectedpasses
- Unit test - one-row pseudo-table rejected: Single row with multiple columns rejected
test_detect_borderless_one_row_pseudo_table_rejectedpasses
- Unit test - 3-row 3-column borderless table accepted: Core table detection works
test_detect_borderless_3x3_table_acceptedpasses
- Unit test - vertical-gap test: Two separate tables with >100 pt gap detected separately
test_detect_borderless_vertical_gap_testpasses
- Public API:
TableDetector::detect_borderless(&PageContext) -> Vec<GridCandidate>exists - Performance: 1.56 ms for 5040 text positions (well below 10 ms requirement)
Implementation Details
The borderless detector in crates/pdftract-core/src/table/detector.rs:
- Collects text positions from content stream (Tm, Td, TD, T*, Tj, TJ, ', " operators)
- Groups by x0 positions within 2.0 pt tolerance using clustering
- Finds column candidates (3+ spans at same x0 on different y positions)
- Finds row candidates (y positions where >= 2 column candidates have spans)
- Validates: 3+ rows AND 3+ columns, contiguous y range, no gap > 100 pt
- Constructs GridCandidate with empty segments (no ruling lines)
- Rejects single-column paragraph reflow patterns
Test Results
cargo test -p pdftract-core --lib table::detector::tests::test_detect_borderless
# running 6 tests
# test table::detector::tests::test_detect_borderless_empty_content ... ok
# test table::detector::tests::test_detect_borderless_no_text_block ... ok
# test table::detector::tests::test_detect_borderless_3x3_table_accepted ... ok
# test table::detector::tests::test_detect_borderless_one_row_pseudo_table_rejected ... ok
# test table::detector::tests::test_detect_borderless_paragraph_rejected ... ok
# test table::detector::tests::test_detect_borderless_vertical_gap_test ... ok
# test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured
Benchmark Results
borderless_detection/text_positions/5040
time: [1.5457 ms 1.5595 ms 1.5755 ms]
Performance target: < 10 ms on 5000-span page Actual: ~1.56 ms (well within requirement)
Files Modified
crates/pdftract-core/benches/table_detection.rs: Added borderless detection benchmark
Files Reviewed (no changes needed)
crates/pdftract-core/src/table/detector.rs: Borderless detection already implementedcrates/pdftract-core/src/table/mod.rs: Public API exportedcrates/pdftract-core/src/lib.rs: Re-exports for public API
Integration Notes
Per task description, borderless detection should run only when line-based detection (7.2.1) returns no GridCandidate covering a region. This is a usage pattern for the caller, not enforced within the detector itself. The detector provides both methods independently:
TableDetector::detect_line_based()- for bordered tablesTableDetector::detect_borderless()- for borderless tables
Callers can orchestrate the fallback logic as needed.