Add two comprehensive integration tests to validate the ParentTree resolver:
1. test_parent_tree_annotation_with_struct_parent:
- Creates a body paragraph StructElem
- Creates ParentTree with page array (MCID 0 -> body, MCID 1 -> orphan/null)
- Creates ParentTree with annotation entry (key 100 -> body)
- Verifies MCID resolution returns correct map and orphans
- Verifies annotation /StructParent resolution returns the body ref
- Verifies the referenced StructElem is in the tree
2. test_parent_tree_off_by_one_missing_entries:
- Creates ParentTree with sparse array (only 3 entries for potentially more MCIDs)
- Verifies non-null entries are correctly mapped
- Verifies null entries are recorded as orphans
- Documents that MCIDs beyond array length would be detected in Phase 7.1.4
Also export ParentTreeResolver and ParentTreeEntry from parser module
for use by the block builder in Phase 7.1.4.
All 67 struct_tree tests pass (18 ParentTree-specific tests).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix 8 tests that incorrectly passed ParentTree dict directly instead of
wrapping it in a StructTreeRoot-like structure with /ParentTree key
- Fix process_nums_array() to preserve null entries as ObjRef { object: 0 }
instead of filtering them out, ensuring orphan MCIDs are correctly reported
- Add verification note for ParentTree-based MCID-to-StructElem resolver
References: pdftract-57o4, plan 7.1 line 2550 (MCID-to-StructElem mapping)
Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting
walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output.
Changes:
- Add BlockKind enum with all output block kinds (paragraph, heading with
level, table, list, list_item, figure, caption, code, block_quote, toc,
formula, reference, note, form_field_struct, inline, structural_container,
artifact, unknown)
- Add MappingResult struct bundling block_kind, is_emitted flag, and optional
diagnostic
- Add structure_type_to_block_kind() function for pure type mapping
- Add map_element_to_block() function as primary mapping API
- Add is_artifact() placeholder for Phase 3.4 marked-content integration
- Add 32 comprehensive unit tests covering all mapping paths
Key features:
- Complete type mapping for all 40+ PDF standard structure types
- Heading level extraction: H->level 1, H1..H6->level 1..6
- Inline elements (Span, Quote) map to Inline (not emitted as blocks)
- Structural containers (Document, Part, Sect, Div, etc.) map to
StructuralContainer (descend without emitting)
- Unknown types emit diagnostic and fall back to paragraph
Acceptance criteria:
- Every Standard structure type has a mapping decision
- Critical test: H1/H2 -> heading level 1/2
- Unit tests: list nesting, table grouping, span passthrough
- Unknown-type fallback path emits a diagnostic line
Refs: Plan section 7.1 lines 2552-2553
Implements the StructTree parser (Phase 7.1.1) with:
- Depth-first walker over /StructTreeRoot via /K array
- Support for all four /K entry types: StructElem, MCID, MCR, OBJR
- /RoleMap resolution with chain handling and cycle detection
- /Lang inheritance through the structure tree
- /ActualText inheritance (applies to all descendant content)
- Public API: StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid
Acceptance criteria:
- PASS: All four /K element kinds handled without crashing
- PASS: /RoleMap chains resolve to standard type or NonStruct
- PASS: /Lang and /ActualText inherit correctly down tree
- PASS: Unit tests for Word RoleMap (Heading1 -> H1)
- PASS: Unit tests for nested /Lang and /ActualText scope
- PASS: Public type StructElemNode documented in core crate
References:
- Plan section 7.1 StructTree Exploitation (lines 2547-2549, 2552-2553)
- PDF 1.7 spec 14.7.4 (Structure Tree) and 14.8.4 (Standard Structure Types)
Co-Authored-By: Claude Code <noreply@anthropic.com>