feat(pdftract-3nwz): add borderless table detection benchmark

- Add borderless detection benchmark to table_detection.rs
- Verify < 10 ms performance requirement (achieved 1.56 ms for 5040 positions)
- Confirm all unit tests pass for borderless detection
- Borderless detection implementation already existed in detector.rs

Acceptance criteria:
- PASS: 3x3 borderless table detected via alignment heuristic
- PASS: paragraph rejected; one-row pseudo-table rejected
- PASS: vertical-gap test; 3-row 3-column borderless table accepted
- PASS: Public API TableDetector::detect_borderless() exists
- PASS: Performance < 10 ms on 5000-span page (measured 1.56 ms)

Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-23 22:29:53 -04:00
parent 8d1e411d7c
commit 8037e67e82
3 changed files with 129 additions and 4 deletions

View file

@ -1 +1 @@
c251db8228b93881476bb9dcdeb2748fa9be1f23
0e466a5ceaaef3e5b3d0d650730bf6ce84c35982

View file

@ -1,7 +1,7 @@
// Benchmark for table detection.
//
// Tests the performance of line-based table detection on pages with
// varying numbers of path segments.
// Tests the performance of line-based and borderless table detection
// on pages with varying numbers of path segments and text positions.
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
use pdftract_core::table::{TableDetector, PageContext};
@ -54,6 +54,35 @@ fn generate_grid_content(num_horiz: usize, num_vert: usize) -> Vec<u8> {
content
}
/// Generate content with text positions for borderless tables.
/// Creates a grid-like pattern of text at aligned positions.
fn generate_borderless_content(num_rows: usize, num_cols: usize) -> Vec<u8> {
let mut content = Vec::new();
let y_start = 700.0;
let y_end = 100.0;
let x_start = 50.0;
let x_spacing = 100.0;
// Start text block
content.extend(b"BT ");
// Generate text positions in a grid pattern
for row in 0..num_rows {
let y = y_start - (row as f32 * (y_start - y_end) / (num_rows.max(1) - 1) as f32);
for col in 0..num_cols {
let x = x_start + (col as f32 * x_spacing);
// Move to position and show text
content.extend(format!("{} {} Td (R{}C{}) Tj ", x, y, row, col).as_bytes());
}
}
// End text block
content.extend(b"ET");
content
}
fn bench_table_detection(c: &mut Criterion) {
let detector = TableDetector::new();
let page = make_page();
@ -90,5 +119,31 @@ fn bench_table_detection(c: &mut Criterion) {
group.finish();
}
criterion_group!(benches, bench_table_detection);
fn bench_borderless_detection(c: &mut Criterion) {
let detector = TableDetector::new();
let page = make_page();
let mut group = c.benchmark_group("borderless_detection");
// Test with increasing numbers of text positions (rows * cols)
for (num_rows, num_cols) in [(3, 3), (5, 5), (10, 10), (20, 20), (50, 50), (70, 72)] {
let total_positions = num_rows * num_cols;
group.bench_with_input(
BenchmarkId::new("text_positions", total_positions),
&total_positions,
|b, _| {
let content = generate_borderless_content(num_rows, num_cols);
let ctx = PageContext::new(&page, &content);
b.iter(|| {
black_box(detector.detect_borderless(black_box(&ctx)))
});
},
);
}
group.finish();
}
criterion_group!(benches, bench_table_detection, bench_borderless_detection);
criterion_main!(benches);

70
notes/pdftract-3nwz.md Normal file
View file

@ -0,0 +1,70 @@
# Verification Note: pdftract-3nwz (Borderless Table Detection)
## Summary
Implemented borderless table detection using x0-aligned span heuristic. The implementation was already present in the codebase and all tests pass.
## Changes Made
1. Added benchmark for borderless detection to verify performance
2. Verified all acceptance criteria are met
## Acceptance Criteria Status
### PASS
- **Critical test**: 3x3 borderless table detected via alignment heuristic
- `test_detect_borderless_3x3_table_accepted` passes
- **Unit test - paragraph rejected**: Single-column text is rejected
- `test_detect_borderless_paragraph_rejected` passes
- **Unit test - one-row pseudo-table rejected**: Single row with multiple columns rejected
- `test_detect_borderless_one_row_pseudo_table_rejected` passes
- **Unit test - 3-row 3-column borderless table accepted**: Core table detection works
- `test_detect_borderless_3x3_table_accepted` passes
- **Unit test - vertical-gap test**: Two separate tables with >100 pt gap detected separately
- `test_detect_borderless_vertical_gap_test` passes
- **Public API**: `TableDetector::detect_borderless(&PageContext) -> Vec<GridCandidate>` exists
- **Performance**: 1.56 ms for 5040 text positions (well below 10 ms requirement)
## Implementation Details
The borderless detector in `crates/pdftract-core/src/table/detector.rs`:
- Collects text positions from content stream (Tm, Td, TD, T*, Tj, TJ, ', " operators)
- Groups by x0 positions within 2.0 pt tolerance using clustering
- Finds column candidates (3+ spans at same x0 on different y positions)
- Finds row candidates (y positions where >= 2 column candidates have spans)
- Validates: 3+ rows AND 3+ columns, contiguous y range, no gap > 100 pt
- Constructs GridCandidate with empty segments (no ruling lines)
- Rejects single-column paragraph reflow patterns
## Test Results
```bash
cargo test -p pdftract-core --lib table::detector::tests::test_detect_borderless
# running 6 tests
# test table::detector::tests::test_detect_borderless_empty_content ... ok
# test table::detector::tests::test_detect_borderless_no_text_block ... ok
# test table::detector::tests::test_detect_borderless_3x3_table_accepted ... ok
# test table::detector::tests::test_detect_borderless_one_row_pseudo_table_rejected ... ok
# test table::detector::tests::test_detect_borderless_paragraph_rejected ... ok
# test table::detector::tests::test_detect_borderless_vertical_gap_test ... ok
# test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured
```
## Benchmark Results
```
borderless_detection/text_positions/5040
time: [1.5457 ms 1.5595 ms 1.5755 ms]
```
Performance target: < 10 ms on 5000-span page
Actual: ~1.56 ms (well within requirement)
## Files Modified
- `crates/pdftract-core/benches/table_detection.rs`: Added borderless detection benchmark
## Files Reviewed (no changes needed)
- `crates/pdftract-core/src/table/detector.rs`: Borderless detection already implemented
- `crates/pdftract-core/src/table/mod.rs`: Public API exported
- `crates/pdftract-core/src/lib.rs`: Re-exports for public API
## Integration Notes
Per task description, borderless detection should run only when line-based detection (7.2.1) returns no GridCandidate covering a region. This is a usage pattern for the caller, not enforced within the detector itself. The detector provides both methods independently:
- `TableDetector::detect_line_based()` - for bordered tables
- `TableDetector::detect_borderless()` - for borderless tables
Callers can orchestrate the fallback logic as needed.