pdftract/notes/pdftract-3nwz.md
jedarden 8037e67e82 feat(pdftract-3nwz): add borderless table detection benchmark
- Add borderless detection benchmark to table_detection.rs
- Verify < 10 ms performance requirement (achieved 1.56 ms for 5040 positions)
- Confirm all unit tests pass for borderless detection
- Borderless detection implementation already existed in detector.rs

Acceptance criteria:
- PASS: 3x3 borderless table detected via alignment heuristic
- PASS: paragraph rejected; one-row pseudo-table rejected
- PASS: vertical-gap test; 3-row 3-column borderless table accepted
- PASS: Public API TableDetector::detect_borderless() exists
- PASS: Performance < 10 ms on 5000-span page (measured 1.56 ms)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 22:30:06 -04:00

3.5 KiB

Verification Note: pdftract-3nwz (Borderless Table Detection)

Summary

Implemented borderless table detection using x0-aligned span heuristic. The implementation was already present in the codebase and all tests pass.

Changes Made

  1. Added benchmark for borderless detection to verify performance
  2. Verified all acceptance criteria are met

Acceptance Criteria Status

PASS

  • Critical test: 3x3 borderless table detected via alignment heuristic
    • test_detect_borderless_3x3_table_accepted passes
  • Unit test - paragraph rejected: Single-column text is rejected
    • test_detect_borderless_paragraph_rejected passes
  • Unit test - one-row pseudo-table rejected: Single row with multiple columns rejected
    • test_detect_borderless_one_row_pseudo_table_rejected passes
  • Unit test - 3-row 3-column borderless table accepted: Core table detection works
    • test_detect_borderless_3x3_table_accepted passes
  • Unit test - vertical-gap test: Two separate tables with >100 pt gap detected separately
    • test_detect_borderless_vertical_gap_test passes
  • Public API: TableDetector::detect_borderless(&PageContext) -> Vec<GridCandidate> exists
  • Performance: 1.56 ms for 5040 text positions (well below 10 ms requirement)

Implementation Details

The borderless detector in crates/pdftract-core/src/table/detector.rs:

  • Collects text positions from content stream (Tm, Td, TD, T*, Tj, TJ, ', " operators)
  • Groups by x0 positions within 2.0 pt tolerance using clustering
  • Finds column candidates (3+ spans at same x0 on different y positions)
  • Finds row candidates (y positions where >= 2 column candidates have spans)
  • Validates: 3+ rows AND 3+ columns, contiguous y range, no gap > 100 pt
  • Constructs GridCandidate with empty segments (no ruling lines)
  • Rejects single-column paragraph reflow patterns

Test Results

cargo test -p pdftract-core --lib table::detector::tests::test_detect_borderless
# running 6 tests
# test table::detector::tests::test_detect_borderless_empty_content ... ok
# test table::detector::tests::test_detect_borderless_no_text_block ... ok
# test table::detector::tests::test_detect_borderless_3x3_table_accepted ... ok
# test table::detector::tests::test_detect_borderless_one_row_pseudo_table_rejected ... ok
# test table::detector::tests::test_detect_borderless_paragraph_rejected ... ok
# test table::detector::tests::test_detect_borderless_vertical_gap_test ... ok
# test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured

Benchmark Results

borderless_detection/text_positions/5040
                        time:   [1.5457 ms 1.5595 ms 1.5755 ms]

Performance target: < 10 ms on 5000-span page Actual: ~1.56 ms (well within requirement)

Files Modified

  • crates/pdftract-core/benches/table_detection.rs: Added borderless detection benchmark

Files Reviewed (no changes needed)

  • crates/pdftract-core/src/table/detector.rs: Borderless detection already implemented
  • crates/pdftract-core/src/table/mod.rs: Public API exported
  • crates/pdftract-core/src/lib.rs: Re-exports for public API

Integration Notes

Per task description, borderless detection should run only when line-based detection (7.2.1) returns no GridCandidate covering a region. This is a usage pattern for the caller, not enforced within the detector itself. The detector provides both methods independently:

  • TableDetector::detect_line_based() - for bordered tables
  • TableDetector::detect_borderless() - for borderless tables

Callers can orchestrate the fallback logic as needed.