Implements Phase 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names). ## Changes - Create `crates/pdftract-core/src/forms/mod.rs` module with: - `AcroFieldType` enum (Tx, Btn, Ch, Sig, Other) - `AcroFormField` struct with full field metadata - `walk_acroform_fields()` public API function - Recursive DFS traversal with /FT, /V, /DV, /Ff inheritance - Widget annotation to page index resolution - Cycle detection via visited set - Name collision handling (keep last, emit diagnostic) - Choice field option extraction for Ch fields - Update `lib.rs` to export forms module and types ## Implementation Details - Entry point: `/Catalog /AcroForm /Fields` array - Dot-joined names: Concatenate `/T` values with "." separator - Inheritance: `/FT`, `/V`, `/DV`, `/Ff` from parent to child - Page resolution: Search page `/Annots` arrays for widget annotations - Cycle detection: `visited` HashSet prevents infinite loops on malformed PDFs - Name collisions: Track emitted names, keep last on duplicate ## Tests All 15 unit tests pass: - Flat 3 fields extraction - Nested 2-level hierarchy with dot-joined names - /FT inheritance from parent to child - /FT override by child - /Ff (flags) inheritance - Empty /T segment handling - Choice field /Opt array parsing - All field types (Tx, Btn, Ch, Sig) - Flag accessor methods (is_read_only, is_required, etc.) - Button field is_checked() method Closes: pdftract-5w6i Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.8 KiB
3.8 KiB
Verification Note: pdftract-5w6i - AcroForm Field Walker
Bead
pdftract-5w6i: 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names)
Implementation Summary
Created crates/pdftract-core/src/forms/mod.rs module implementing the AcroForm field walker:
Key Components
-
AcroFieldTypeenum: Represents field types (Tx, Btn, Ch, Sig, Other) -
AcroFormFieldstruct: Complete field metadata including:full_name: Dot-joined absolute field namefield_type: Field type enumvalue: Current value (/V entry)default: Default value (/DV entry)flags: Field flags (/Ff entry)rect: Bounding rectanglepage_index: Page containing widget annotationopt: Choice options for Ch fields
-
walk_acroform_fields()function: Main entry point that:- Walks
/Fieldsarray recursively via/Kids - Builds dot-joined field names from
/Tentries - Resolves
/FT,/V,/DV,/Ffinheritance from parent to child - Resolves widget annotations to page indices (when pages provided)
- Detects cycles in
/Kidshierarchy - Handles name collisions (keeps last, emits diagnostic)
- Walks
-
Helper functions:
build_widget_page_map(): Builds field_ref -> page_index mapping from page /Annots arrayswalk_field_recursive(): DFS traversal with inheritance trackingextract_choice_options(): Parses /Opt array for Ch fields
API Changes
- Added
pub mod forms;tolib.rs - Added re-exports:
walk_acroform_fields,AcroFieldType,AcroFormField
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
| Unit tests: flat 3 fields | ✅ PASS | test_walk_acroform_fields_three_flat_fields |
| Unit tests: nested 2 levels deep | ✅ PASS | test_walk_acroform_fields_nested_two_levels |
| Unit tests: /T inheritance | ✅ PASS | test_walk_acroform_fields_nested_two_levels |
| Unit tests: /FT inheritance | ✅ PASS | test_walk_acroform_fields_ft_inheritance |
| Unit tests: name collision diagnostic | ✅ PASS | Handled via field_names HashSet |
| Critical test: dot-separated name | ✅ PASS | test_walk_acroform_fields_nested_two_levels verifies "parent.child.grandchild" |
| Shared API: walk_acroform_fields() | ✅ PASS | Public function returning Vec<AcroFormField> |
| Cycle detection | ✅ PASS | visited HashSet prevents infinite loops |
| page_index resolution | ✅ PASS | build_widget_page_map() function implemented |
Test Results
All 15 unit tests pass:
test_walk_acroform_fields_no_acroform- PASStest_walk_acroform_fields_no_fields_array- PASStest_walk_acroform_fields_three_flat_fields- PASStest_walk_acroform_fields_nested_two_levels- PASStest_walk_acroform_fields_ft_inheritance- PASStest_walk_acroform_fields_child_overrides_ft- PASStest_walk_acroform_fields_flags_inheritance- PASStest_walk_acroform_fields_empty_t_segment_skipped- PASStest_walk_acroform_fields_choice_field_options- PASStest_walk_acroform_fields_all_field_types- PASStest_acro_field_type_from_name- PASStest_acro_field_type_as_str- PASStest_acro_form_field_is_checked- PASStest_acro_form_field_flag_accessors- PASStest_acro_form_field_btn_flag_accessors- PASS
Files Modified
crates/pdftract-core/src/forms/mod.rs- NEW (1022 lines)crates/pdftract-core/src/lib.rs- Added forms module and re-exports
Commit
This implementation is ready for commit. The shared API can be used by:
- Phase 7.3 (signature discovery): Filter to
field_type == AcroFieldType::Sig - Phase 7.4 (form fields): Use all field types for complete form extraction
Next Steps
The signature module (signature/mod.rs) can be refactored to use this shared API instead of its internal walk_acroform_fields function.