- Fix table block bbox to use actual grid bbox instead of placeholder - Add schema validation tests for tables array emission - Verify two-page table detection integration Files modified: - crates/pdftract-core/src/extract.rs: Use grid bbox for table blocks - crates/pdftract-core/src/schema/mod.rs: Add tests for tables array emission Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
81 lines
3.1 KiB
Markdown
81 lines
3.1 KiB
Markdown
# pdftract-5mph: Table block + table JSON output schema integration
|
|
|
|
## Summary
|
|
|
|
Implemented the final output shape for tables with dual emission (Block + Table object) and two-page table detection.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Fixed Table Block Bbox (extract.rs)
|
|
- **Issue**: Table blocks were using placeholder bbox `[0.0, 0.0, 0.0, 0.0]` instead of the actual grid bbox
|
|
- **Fix**: Changed to use the grid's actual bbox from `table.grid.bbox`
|
|
- **File**: `crates/pdftract-core/src/extract.rs:1131-1153`
|
|
|
|
### 2. Added Schema Validation Tests (schema/mod.rs)
|
|
- **Test 1**: `test_tables_array_emitted_on_page_output` - Verifies tables array is always emitted (even when empty)
|
|
- **Test 2**: `test_table_block_emission_shape` - Verifies table blocks have correct shape with table_index
|
|
- **File**: `crates/pdftract-core/src/schema/mod.rs:828-886`
|
|
|
|
### 3. Added serde_json import
|
|
- Added `use serde_json::json;` to support JSON macro in tests
|
|
- **File**: `crates/pdftract-core/src/schema/mod.rs:19-21`
|
|
|
|
## Implementation Verification
|
|
|
|
### PASS: Block Emission
|
|
- Block.kind = "table" ✓
|
|
- Block.table_index points to tables array ✓
|
|
- Block.bbox uses actual grid bbox ✓
|
|
|
|
### PASS: Table Object (in page.tables array)
|
|
- id: "table_N" format ✓
|
|
- bbox: [x0, y0, x1, y1] ✓
|
|
- rows: Vec<RowJson> ✓
|
|
- header_rows: u32 ✓
|
|
- detection_method: "line_based" | "borderless" ✓
|
|
- continued: bool ✓
|
|
- continued_from_prev: bool ✓
|
|
- page_index: usize ✓
|
|
|
|
### PASS: Two-Page Table Detection
|
|
- `detect_two_page_tables` function in table/output.rs ✓
|
|
- Applied via `apply_two_page_table_detection` in extract.rs ✓
|
|
- Flags set when:
|
|
- Table on page N ends within 50 pt of page bottom
|
|
- Table on page N+1 starts within 50 pt of page top
|
|
- Same column count and similar col_xs (RMSE < 5 pt)
|
|
|
|
### PASS: Schema Validation
|
|
- Schema JSON at docs/schema/v1.0/pdftract.schema.json already defines table structure ✓
|
|
- Round-trip test `test_v_1_0_table_schema_roundtrip` passing ✓
|
|
|
|
### PASS: Tables Array Emission
|
|
- PageResultInternal has `tables: Vec<TableWithGrid>` ✓
|
|
- PageResult has `tables: Vec<TableJson>` ✓
|
|
- JSON output includes tables array even when empty ✓
|
|
|
|
## Test Results
|
|
|
|
All tests passing:
|
|
- 25 schema tests (including 2 new tests)
|
|
- 112 table module tests
|
|
- `test_v_1_0_table_schema_roundtrip` - PASS ✓
|
|
- `test_detect_two_page_tables_basic` - PASS ✓
|
|
- `test_tables_array_emitted_on_page_output` - PASS ✓
|
|
- `test_table_block_emission_shape` - PASS ✓
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [x] All other 7.2.x sub-tasks closed (assumed from context)
|
|
- [x] Critical test: table spanning two pages - detected and flagged
|
|
- [x] Schema test: tables array emitted on every page output (even when empty)
|
|
- [x] Round-trip test: synthetic table -> JSON -> schema validate
|
|
- [x] Both Block.kind = "table" AND page.tables[i] present
|
|
- [x] docs/schema/v1.0/pdftract.schema.json already updated (no changes needed)
|
|
|
|
## Notes
|
|
|
|
- The schema JSON file was already correctly defined - no changes needed
|
|
- The two-page table detection logic was already implemented in table/output.rs
|
|
- The main fix was correcting the table block bbox from placeholder to actual grid bbox
|
|
- Added tests to verify the schema stability requirements
|