- Fix table block bbox to use actual grid bbox instead of placeholder - Add schema validation tests for tables array emission - Verify two-page table detection integration Files modified: - crates/pdftract-core/src/extract.rs: Use grid bbox for table blocks - crates/pdftract-core/src/schema/mod.rs: Add tests for tables array emission Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.1 KiB
3.1 KiB
pdftract-5mph: Table block + table JSON output schema integration
Summary
Implemented the final output shape for tables with dual emission (Block + Table object) and two-page table detection.
Changes Made
1. Fixed Table Block Bbox (extract.rs)
- Issue: Table blocks were using placeholder bbox
[0.0, 0.0, 0.0, 0.0]instead of the actual grid bbox - Fix: Changed to use the grid's actual bbox from
table.grid.bbox - File:
crates/pdftract-core/src/extract.rs:1131-1153
2. Added Schema Validation Tests (schema/mod.rs)
- Test 1:
test_tables_array_emitted_on_page_output- Verifies tables array is always emitted (even when empty) - Test 2:
test_table_block_emission_shape- Verifies table blocks have correct shape with table_index - File:
crates/pdftract-core/src/schema/mod.rs:828-886
3. Added serde_json import
- Added
use serde_json::json;to support JSON macro in tests - File:
crates/pdftract-core/src/schema/mod.rs:19-21
Implementation Verification
PASS: Block Emission
- Block.kind = "table" ✓
- Block.table_index points to tables array ✓
- Block.bbox uses actual grid bbox ✓
PASS: Table Object (in page.tables array)
- id: "table_N" format ✓
- bbox: [x0, y0, x1, y1] ✓
- rows: Vec ✓
- header_rows: u32 ✓
- detection_method: "line_based" | "borderless" ✓
- continued: bool ✓
- continued_from_prev: bool ✓
- page_index: usize ✓
PASS: Two-Page Table Detection
detect_two_page_tablesfunction in table/output.rs ✓- Applied via
apply_two_page_table_detectionin extract.rs ✓ - Flags set when:
- Table on page N ends within 50 pt of page bottom
- Table on page N+1 starts within 50 pt of page top
- Same column count and similar col_xs (RMSE < 5 pt)
PASS: Schema Validation
- Schema JSON at docs/schema/v1.0/pdftract.schema.json already defines table structure ✓
- Round-trip test
test_v_1_0_table_schema_roundtrippassing ✓
PASS: Tables Array Emission
- PageResultInternal has
tables: Vec<TableWithGrid>✓ - PageResult has
tables: Vec<TableJson>✓ - JSON output includes tables array even when empty ✓
Test Results
All tests passing:
- 25 schema tests (including 2 new tests)
- 112 table module tests
test_v_1_0_table_schema_roundtrip- PASS ✓test_detect_two_page_tables_basic- PASS ✓test_tables_array_emitted_on_page_output- PASS ✓test_table_block_emission_shape- PASS ✓
Acceptance Criteria
- All other 7.2.x sub-tasks closed (assumed from context)
- Critical test: table spanning two pages - detected and flagged
- Schema test: tables array emitted on every page output (even when empty)
- Round-trip test: synthetic table -> JSON -> schema validate
- Both Block.kind = "table" AND page.tables[i] present
- docs/schema/v1.0/pdftract.schema.json already updated (no changes needed)
Notes
- The schema JSON file was already correctly defined - no changes needed
- The two-page table detection logic was already implemented in table/output.rs
- The main fix was correcting the table block bbox from placeholder to actual grid bbox
- Added tests to verify the schema stability requirements