pdftract/notes/pdftract-3nwz.md
jedarden 8037e67e82 feat(pdftract-3nwz): add borderless table detection benchmark
- Add borderless detection benchmark to table_detection.rs
- Verify < 10 ms performance requirement (achieved 1.56 ms for 5040 positions)
- Confirm all unit tests pass for borderless detection
- Borderless detection implementation already existed in detector.rs

Acceptance criteria:
- PASS: 3x3 borderless table detected via alignment heuristic
- PASS: paragraph rejected; one-row pseudo-table rejected
- PASS: vertical-gap test; 3-row 3-column borderless table accepted
- PASS: Public API TableDetector::detect_borderless() exists
- PASS: Performance < 10 ms on 5000-span page (measured 1.56 ms)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 22:30:06 -04:00

70 lines
3.5 KiB
Markdown

# Verification Note: pdftract-3nwz (Borderless Table Detection)
## Summary
Implemented borderless table detection using x0-aligned span heuristic. The implementation was already present in the codebase and all tests pass.
## Changes Made
1. Added benchmark for borderless detection to verify performance
2. Verified all acceptance criteria are met
## Acceptance Criteria Status
### PASS
- **Critical test**: 3x3 borderless table detected via alignment heuristic
- `test_detect_borderless_3x3_table_accepted` passes
- **Unit test - paragraph rejected**: Single-column text is rejected
- `test_detect_borderless_paragraph_rejected` passes
- **Unit test - one-row pseudo-table rejected**: Single row with multiple columns rejected
- `test_detect_borderless_one_row_pseudo_table_rejected` passes
- **Unit test - 3-row 3-column borderless table accepted**: Core table detection works
- `test_detect_borderless_3x3_table_accepted` passes
- **Unit test - vertical-gap test**: Two separate tables with >100 pt gap detected separately
- `test_detect_borderless_vertical_gap_test` passes
- **Public API**: `TableDetector::detect_borderless(&PageContext) -> Vec<GridCandidate>` exists
- **Performance**: 1.56 ms for 5040 text positions (well below 10 ms requirement)
## Implementation Details
The borderless detector in `crates/pdftract-core/src/table/detector.rs`:
- Collects text positions from content stream (Tm, Td, TD, T*, Tj, TJ, ', " operators)
- Groups by x0 positions within 2.0 pt tolerance using clustering
- Finds column candidates (3+ spans at same x0 on different y positions)
- Finds row candidates (y positions where >= 2 column candidates have spans)
- Validates: 3+ rows AND 3+ columns, contiguous y range, no gap > 100 pt
- Constructs GridCandidate with empty segments (no ruling lines)
- Rejects single-column paragraph reflow patterns
## Test Results
```bash
cargo test -p pdftract-core --lib table::detector::tests::test_detect_borderless
# running 6 tests
# test table::detector::tests::test_detect_borderless_empty_content ... ok
# test table::detector::tests::test_detect_borderless_no_text_block ... ok
# test table::detector::tests::test_detect_borderless_3x3_table_accepted ... ok
# test table::detector::tests::test_detect_borderless_one_row_pseudo_table_rejected ... ok
# test table::detector::tests::test_detect_borderless_paragraph_rejected ... ok
# test table::detector::tests::test_detect_borderless_vertical_gap_test ... ok
# test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured
```
## Benchmark Results
```
borderless_detection/text_positions/5040
time: [1.5457 ms 1.5595 ms 1.5755 ms]
```
Performance target: < 10 ms on 5000-span page
Actual: ~1.56 ms (well within requirement)
## Files Modified
- `crates/pdftract-core/benches/table_detection.rs`: Added borderless detection benchmark
## Files Reviewed (no changes needed)
- `crates/pdftract-core/src/table/detector.rs`: Borderless detection already implemented
- `crates/pdftract-core/src/table/mod.rs`: Public API exported
- `crates/pdftract-core/src/lib.rs`: Re-exports for public API
## Integration Notes
Per task description, borderless detection should run only when line-based detection (7.2.1) returns no GridCandidate covering a region. This is a usage pattern for the caller, not enforced within the detector itself. The detector provides both methods independently:
- `TableDetector::detect_line_based()` - for bordered tables
- `TableDetector::detect_borderless()` - for borderless tables
Callers can orchestrate the fallback logic as needed.