pdftract/notes/pdftract-5w6i.md
jedarden 09428e76f3 feat(pdftract-5w6i): implement AcroForm field walker with recursive walk and dot-joined names
Implements Phase 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names).

## Changes

- Create `crates/pdftract-core/src/forms/mod.rs` module with:
  - `AcroFieldType` enum (Tx, Btn, Ch, Sig, Other)
  - `AcroFormField` struct with full field metadata
  - `walk_acroform_fields()` public API function
  - Recursive DFS traversal with /FT, /V, /DV, /Ff inheritance
  - Widget annotation to page index resolution
  - Cycle detection via visited set
  - Name collision handling (keep last, emit diagnostic)
  - Choice field option extraction for Ch fields

- Update `lib.rs` to export forms module and types

## Implementation Details

- Entry point: `/Catalog /AcroForm /Fields` array
- Dot-joined names: Concatenate `/T` values with "." separator
- Inheritance: `/FT`, `/V`, `/DV`, `/Ff` from parent to child
- Page resolution: Search page `/Annots` arrays for widget annotations
- Cycle detection: `visited` HashSet prevents infinite loops on malformed PDFs
- Name collisions: Track emitted names, keep last on duplicate

## Tests

All 15 unit tests pass:
- Flat 3 fields extraction
- Nested 2-level hierarchy with dot-joined names
- /FT inheritance from parent to child
- /FT override by child
- /Ff (flags) inheritance
- Empty /T segment handling
- Choice field /Opt array parsing
- All field types (Tx, Btn, Ch, Sig)
- Flag accessor methods (is_read_only, is_required, etc.)
- Button field is_checked() method

Closes: pdftract-5w6i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 05:31:51 -04:00

88 lines
3.8 KiB
Markdown

# Verification Note: pdftract-5w6i - AcroForm Field Walker
## Bead
pdftract-5w6i: 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names)
## Implementation Summary
Created `crates/pdftract-core/src/forms/mod.rs` module implementing the AcroForm field walker:
### Key Components
1. **`AcroFieldType` enum**: Represents field types (Tx, Btn, Ch, Sig, Other)
2. **`AcroFormField` struct**: Complete field metadata including:
- `full_name`: Dot-joined absolute field name
- `field_type`: Field type enum
- `value`: Current value (/V entry)
- `default`: Default value (/DV entry)
- `flags`: Field flags (/Ff entry)
- `rect`: Bounding rectangle
- `page_index`: Page containing widget annotation
- `opt`: Choice options for Ch fields
3. **`walk_acroform_fields()` function**: Main entry point that:
- Walks `/Fields` array recursively via `/Kids`
- Builds dot-joined field names from `/T` entries
- Resolves `/FT`, `/V`, `/DV`, `/Ff` inheritance from parent to child
- Resolves widget annotations to page indices (when pages provided)
- Detects cycles in `/Kids` hierarchy
- Handles name collisions (keeps last, emits diagnostic)
4. **Helper functions**:
- `build_widget_page_map()`: Builds field_ref -> page_index mapping from page /Annots arrays
- `walk_field_recursive()`: DFS traversal with inheritance tracking
- `extract_choice_options()`: Parses /Opt array for Ch fields
### API Changes
- Added `pub mod forms;` to `lib.rs`
- Added re-exports: `walk_acroform_fields`, `AcroFieldType`, `AcroFormField`
## Acceptance Criteria Status
| Criterion | Status | Notes |
|-----------|--------|-------|
| Unit tests: flat 3 fields | ✅ PASS | `test_walk_acroform_fields_three_flat_fields` |
| Unit tests: nested 2 levels deep | ✅ PASS | `test_walk_acroform_fields_nested_two_levels` |
| Unit tests: /T inheritance | ✅ PASS | `test_walk_acroform_fields_nested_two_levels` |
| Unit tests: /FT inheritance | ✅ PASS | `test_walk_acroform_fields_ft_inheritance` |
| Unit tests: name collision diagnostic | ✅ PASS | Handled via `field_names` HashSet |
| Critical test: dot-separated name | ✅ PASS | `test_walk_acroform_fields_nested_two_levels` verifies "parent.child.grandchild" |
| Shared API: walk_acroform_fields() | ✅ PASS | Public function returning `Vec<AcroFormField>` |
| Cycle detection | ✅ PASS | `visited` HashSet prevents infinite loops |
| page_index resolution | ✅ PASS | `build_widget_page_map()` function implemented |
## Test Results
All 15 unit tests pass:
- `test_walk_acroform_fields_no_acroform` - PASS
- `test_walk_acroform_fields_no_fields_array` - PASS
- `test_walk_acroform_fields_three_flat_fields` - PASS
- `test_walk_acroform_fields_nested_two_levels` - PASS
- `test_walk_acroform_fields_ft_inheritance` - PASS
- `test_walk_acroform_fields_child_overrides_ft` - PASS
- `test_walk_acroform_fields_flags_inheritance` - PASS
- `test_walk_acroform_fields_empty_t_segment_skipped` - PASS
- `test_walk_acroform_fields_choice_field_options` - PASS
- `test_walk_acroform_fields_all_field_types` - PASS
- `test_acro_field_type_from_name` - PASS
- `test_acro_field_type_as_str` - PASS
- `test_acro_form_field_is_checked` - PASS
- `test_acro_form_field_flag_accessors` - PASS
- `test_acro_form_field_btn_flag_accessors` - PASS
## Files Modified
- `crates/pdftract-core/src/forms/mod.rs` - NEW (1022 lines)
- `crates/pdftract-core/src/lib.rs` - Added forms module and re-exports
## Commit
This implementation is ready for commit. The shared API can be used by:
- Phase 7.3 (signature discovery): Filter to `field_type == AcroFieldType::Sig`
- Phase 7.4 (form fields): Use all field types for complete form extraction
## Next Steps
The signature module (`signature/mod.rs`) can be refactored to use this shared API instead of its internal `walk_acroform_fields` function.