Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output. Changes: - Add BlockKind enum with all output block kinds (paragraph, heading with level, table, list, list_item, figure, caption, code, block_quote, toc, formula, reference, note, form_field_struct, inline, structural_container, artifact, unknown) - Add MappingResult struct bundling block_kind, is_emitted flag, and optional diagnostic - Add structure_type_to_block_kind() function for pure type mapping - Add map_element_to_block() function as primary mapping API - Add is_artifact() placeholder for Phase 3.4 marked-content integration - Add 32 comprehensive unit tests covering all mapping paths Key features: - Complete type mapping for all 40+ PDF standard structure types - Heading level extraction: H->level 1, H1..H6->level 1..6 - Inline elements (Span, Quote) map to Inline (not emitted as blocks) - Structural containers (Document, Part, Sect, Div, etc.) map to StructuralContainer (descend without emitting) - Unknown types emit diagnostic and fall back to paragraph Acceptance criteria: - Every Standard structure type has a mapping decision - Critical test: H1/H2 -> heading level 1/2 - Unit tests: list nesting, table grouping, span passthrough - Unknown-type fallback path emits a diagnostic line Refs: Plan section 7.1 lines 2552-2553
7.1 KiB
7.1 KiB
pdftract-2ork: Element-type to block-kind mapping table
Summary
Implemented the StandardType -> BlockKind mapping that converts walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output. Includes Artifact suppression and heading-level extraction (H, H1..H6 -> heading with level).
Implementation
Files Modified/Created
crates/pdftract-core/src/parser/struct_tree.rs(added 420+ lines)crates/pdftract-core/src/parser/mod.rs(updated exports)
Core Types Added
BlockKind: Enum covering all output block kinds (paragraph, heading with level, table, list, list_item, figure, caption, code, block_quote, toc, formula, reference, note, form_field_struct, inline, structural_container, artifact, unknown)MappingResult: Result type for mapping operations containing block_kind, is_emitted flag, and optional diagnosticstructure_type_to_block_kind(): Pure mapping function from StructureType to BlockKindmap_element_to_block(): Primary mapping function taking StructElemNode and returning MappingResultis_artifact(): Placeholder for Artifact marked-content integration (Phase 3.4)
Key Features
-
Complete type mapping:
- Block-level elements (P, H1..H6, Table, L, LI, Lbl, LBody, Figure, Caption, Code, BlockQuote, TOC, TOCI, Formula, Reference, Note, Form) → emitted block kinds
- Inline elements (Span, Quote, Link, Ruby, etc.) → Inline (not emitted as separate blocks)
- Structural containers (Document, Part, Art, Sect, Div, NonStruct, Private, Index, TR, TH, TD, THead, TBody, TFoot) → StructuralContainer (descend without emitting)
- Unknown types → Unknown (emits as paragraph with diagnostic)
-
Heading level extraction:
- H (no explicit level) → Heading{level: 1}
- H1..H6 → Heading{level: 1..6}
- No auto-increment for nested H elements (spec leaves this to producer)
-
Artifact handling:
- Placeholder
is_artifact()function ready for Phase 3.4 marked-content integration - When integrated, will suppress both "Artifact" structure type and MCIDs inside Artifact marked-content sequences
- Placeholder
-
Diagnostic support:
- Unknown types emit a diagnostic warning
- MappingResult includes optional Diagnostic for downstream collection
Verification
Unit Tests (32 new tests, all PASS)
test parser::struct_tree::tests::test_block_kind_paragraph ... ok
test parser::struct_tree::tests::test_block_kind_heading_h ... ok
test parser::struct_tree::tests::test_block_kind_heading_h1 ... ok
test parser::struct_tree::tests::test_block_kind_heading_h2 ... ok
test parser::struct_tree::tests::test_block_kind_heading_all_levels ... ok
test parser::struct_tree::tests::test_block_kind_table ... ok
test parser::struct_tree::tests::test_block_kind_list ... ok
test parser::struct_tree::tests::test_block_kind_list_item ... ok
test parser::struct_tree::tests::test_block_kind_list_label ... ok
test parser::struct_tree::tests::test_block_kind_list_body ... ok
test parser::struct_tree::tests::test_block_kind_figure ... ok
test parser::struct_tree::tests::test_block_kind_caption ... ok
test parser::struct_tree::tests::test_block_kind_code ... ok
test parser::struct_tree::tests::test_block_kind_block_quote ... ok
test parser::struct_tree::tests::test_block_kind_toc ... ok
test parser::struct_tree::tests::test_block_kind_formula ... ok
test parser::struct_tree::tests::test_block_kind_reference ... ok
test parser::struct_tree::tests::test_block_kind_note ... ok
test parser::struct_tree::tests::test_block_kind_form ... ok
test parser::struct_tree::tests::test_block_kind_inline_span ... ok
test parser::struct_tree::tests::test_block_kind_inline_quote ... ok
test parser::struct_tree::tests::test_block_kind_structural_container ... ok
test parser::struct_tree::tests::test_block_kind_unknown ... ok
test parser::struct_tree::tests::test_mapping_result_for_paragraph ... ok
test parser::struct_tree::tests::test_mapping_result_for_heading_with_level ... ok
test parser::struct_tree::tests::test_mapping_result_for_unknown_type ... ok
test parser::struct_tree::tests::test_mapping_result_for_inline_element ... ok
test parser::struct_tree::tests::test_mapping_result_for_structural_container ... ok
test parser::struct_tree::tests::test_list_nesting_mapping ... ok
test parser::struct_tree::tests::test_table_grouping_mapping ... ok
test parser::struct_tree::tests::test_span_passthrough ... ok
test parser::struct_tree::tests::test_heading_level_not_auto_incremented ... ok
Acceptance Criteria Status
- ✓ Every Standard structure type has a mapping decision (in-table, suppressed, or structural-container)
- ✓ Critical test: H1/H2 -> heading level 1/2
- ✓ Unit tests: list nesting (L, LI, Lbl, LBody all map correctly)
- ✓ Unit tests: table grouping (TR, TH, TD, THead, TBody, TFoot → StructuralContainer)
- ✓ Unit tests: span passthrough (Span, Quote → Inline, not emitted)
- ✓ Unknown-type fallback path emits a diagnostic line
Integration Notes
Public API
The following are now exported from pdftract-core::parser:
BlockKindenumMappingResultstructstructure_type_to_block_kind()functionmap_element_to_block()functionis_artifact()function
Usage Example
use pdftract_core::parser::{map_element_to_block, StructElemNode};
// Map a structure element node to its block kind
let result = map_element_to_block(&node);
if result.is_emitted {
// Emit a block with kind = result.block_kind.as_str()
if let Some(level) = result.block_kind.heading_level() {
// Include level in heading block
}
}
if let Some(diag) = result.diagnostic {
diagnostics.push(diag);
}
Future Work
- Phase 3.4 integration: Connect
is_artifact()to marked-content tagger to suppress MCIDs inside Artifact marked-content sequences - Phase 7.1 walker integration: Use
map_element_to_block()in the depth-first walker to classify nodes for output
Commit
- Commit:
3a2b9c8 - Message:
feat(pdftract-2ork): implement element-type to block-kind mapping table
Retrospective
What worked
- Clean separation between
BlockKind(internal enum) and output string representation viaas_str() - Comprehensive test coverage for all mapping paths (32 tests covering block-level, inline, structural container, and unknown types)
MappingResultnicely bundles block kind with emit flag and diagnostic
What didn't
- Initial design didn't include
is_emitted()method onBlockKind, had to duplicate the logic inMappingResult. Addedis_emitted()toBlockKindfor cleaner API.
Surprise
- PDF 1.7 has 40+ standard structure types, and the categorization (block-level vs inline vs structural container) isn't always obvious from the spec alone. Had to cross-reference multiple sources to get the mapping right.
Reusable pattern
- For enum-to-string mapping that needs to support fallback values, use an enum with a derived
as_str()method that can return different values than the enum variant name (e.g.,Unknown→ "paragraph").
References
- Plan section 7.1 lines 2552-2553
- PDF 1.7 spec §14.8.4 (Standard Structure Types)
- pdftract-1x2 (StructTree depth-first walker with RoleMap resolution)