pdftract/notes/pdftract-5mph.md
jedarden ba551b04d1 feat(pdftract-5mph): implement table block + table JSON output schema integration
- Fix table block bbox to use actual grid bbox instead of placeholder
- Add schema validation tests for tables array emission
- Verify two-page table detection integration

Files modified:
- crates/pdftract-core/src/extract.rs: Use grid bbox for table blocks
- crates/pdftract-core/src/schema/mod.rs: Add tests for tables array emission

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:49:01 -04:00

3.1 KiB

pdftract-5mph: Table block + table JSON output schema integration

Summary

Implemented the final output shape for tables with dual emission (Block + Table object) and two-page table detection.

Changes Made

1. Fixed Table Block Bbox (extract.rs)

  • Issue: Table blocks were using placeholder bbox [0.0, 0.0, 0.0, 0.0] instead of the actual grid bbox
  • Fix: Changed to use the grid's actual bbox from table.grid.bbox
  • File: crates/pdftract-core/src/extract.rs:1131-1153

2. Added Schema Validation Tests (schema/mod.rs)

  • Test 1: test_tables_array_emitted_on_page_output - Verifies tables array is always emitted (even when empty)
  • Test 2: test_table_block_emission_shape - Verifies table blocks have correct shape with table_index
  • File: crates/pdftract-core/src/schema/mod.rs:828-886

3. Added serde_json import

  • Added use serde_json::json; to support JSON macro in tests
  • File: crates/pdftract-core/src/schema/mod.rs:19-21

Implementation Verification

PASS: Block Emission

  • Block.kind = "table" ✓
  • Block.table_index points to tables array ✓
  • Block.bbox uses actual grid bbox ✓

PASS: Table Object (in page.tables array)

  • id: "table_N" format ✓
  • bbox: [x0, y0, x1, y1] ✓
  • rows: Vec ✓
  • header_rows: u32 ✓
  • detection_method: "line_based" | "borderless" ✓
  • continued: bool ✓
  • continued_from_prev: bool ✓
  • page_index: usize ✓

PASS: Two-Page Table Detection

  • detect_two_page_tables function in table/output.rs ✓
  • Applied via apply_two_page_table_detection in extract.rs ✓
  • Flags set when:
    • Table on page N ends within 50 pt of page bottom
    • Table on page N+1 starts within 50 pt of page top
    • Same column count and similar col_xs (RMSE < 5 pt)

PASS: Schema Validation

  • Schema JSON at docs/schema/v1.0/pdftract.schema.json already defines table structure ✓
  • Round-trip test test_v_1_0_table_schema_roundtrip passing ✓

PASS: Tables Array Emission

  • PageResultInternal has tables: Vec<TableWithGrid>
  • PageResult has tables: Vec<TableJson>
  • JSON output includes tables array even when empty ✓

Test Results

All tests passing:

  • 25 schema tests (including 2 new tests)
  • 112 table module tests
  • test_v_1_0_table_schema_roundtrip - PASS ✓
  • test_detect_two_page_tables_basic - PASS ✓
  • test_tables_array_emitted_on_page_output - PASS ✓
  • test_table_block_emission_shape - PASS ✓

Acceptance Criteria

  • All other 7.2.x sub-tasks closed (assumed from context)
  • Critical test: table spanning two pages - detected and flagged
  • Schema test: tables array emitted on every page output (even when empty)
  • Round-trip test: synthetic table -> JSON -> schema validate
  • Both Block.kind = "table" AND page.tables[i] present
  • docs/schema/v1.0/pdftract.schema.json already updated (no changes needed)

Notes

  • The schema JSON file was already correctly defined - no changes needed
  • The two-page table detection logic was already implemented in table/output.rs
  • The main fix was correcting the table block bbox from placeholder to actual grid bbox
  • Added tests to verify the schema stability requirements