pdftract/notes/pdftract-5w6i.md
jedarden 09428e76f3 feat(pdftract-5w6i): implement AcroForm field walker with recursive walk and dot-joined names
Implements Phase 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names).

## Changes

- Create `crates/pdftract-core/src/forms/mod.rs` module with:
  - `AcroFieldType` enum (Tx, Btn, Ch, Sig, Other)
  - `AcroFormField` struct with full field metadata
  - `walk_acroform_fields()` public API function
  - Recursive DFS traversal with /FT, /V, /DV, /Ff inheritance
  - Widget annotation to page index resolution
  - Cycle detection via visited set
  - Name collision handling (keep last, emit diagnostic)
  - Choice field option extraction for Ch fields

- Update `lib.rs` to export forms module and types

## Implementation Details

- Entry point: `/Catalog /AcroForm /Fields` array
- Dot-joined names: Concatenate `/T` values with "." separator
- Inheritance: `/FT`, `/V`, `/DV`, `/Ff` from parent to child
- Page resolution: Search page `/Annots` arrays for widget annotations
- Cycle detection: `visited` HashSet prevents infinite loops on malformed PDFs
- Name collisions: Track emitted names, keep last on duplicate

## Tests

All 15 unit tests pass:
- Flat 3 fields extraction
- Nested 2-level hierarchy with dot-joined names
- /FT inheritance from parent to child
- /FT override by child
- /Ff (flags) inheritance
- Empty /T segment handling
- Choice field /Opt array parsing
- All field types (Tx, Btn, Ch, Sig)
- Flag accessor methods (is_read_only, is_required, etc.)
- Button field is_checked() method

Closes: pdftract-5w6i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 05:31:51 -04:00

3.8 KiB

Verification Note: pdftract-5w6i - AcroForm Field Walker

Bead

pdftract-5w6i: 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names)

Implementation Summary

Created crates/pdftract-core/src/forms/mod.rs module implementing the AcroForm field walker:

Key Components

  1. AcroFieldType enum: Represents field types (Tx, Btn, Ch, Sig, Other)

  2. AcroFormField struct: Complete field metadata including:

    • full_name: Dot-joined absolute field name
    • field_type: Field type enum
    • value: Current value (/V entry)
    • default: Default value (/DV entry)
    • flags: Field flags (/Ff entry)
    • rect: Bounding rectangle
    • page_index: Page containing widget annotation
    • opt: Choice options for Ch fields
  3. walk_acroform_fields() function: Main entry point that:

    • Walks /Fields array recursively via /Kids
    • Builds dot-joined field names from /T entries
    • Resolves /FT, /V, /DV, /Ff inheritance from parent to child
    • Resolves widget annotations to page indices (when pages provided)
    • Detects cycles in /Kids hierarchy
    • Handles name collisions (keeps last, emits diagnostic)
  4. Helper functions:

    • build_widget_page_map(): Builds field_ref -> page_index mapping from page /Annots arrays
    • walk_field_recursive(): DFS traversal with inheritance tracking
    • extract_choice_options(): Parses /Opt array for Ch fields

API Changes

  • Added pub mod forms; to lib.rs
  • Added re-exports: walk_acroform_fields, AcroFieldType, AcroFormField

Acceptance Criteria Status

Criterion Status Notes
Unit tests: flat 3 fields PASS test_walk_acroform_fields_three_flat_fields
Unit tests: nested 2 levels deep PASS test_walk_acroform_fields_nested_two_levels
Unit tests: /T inheritance PASS test_walk_acroform_fields_nested_two_levels
Unit tests: /FT inheritance PASS test_walk_acroform_fields_ft_inheritance
Unit tests: name collision diagnostic PASS Handled via field_names HashSet
Critical test: dot-separated name PASS test_walk_acroform_fields_nested_two_levels verifies "parent.child.grandchild"
Shared API: walk_acroform_fields() PASS Public function returning Vec<AcroFormField>
Cycle detection PASS visited HashSet prevents infinite loops
page_index resolution PASS build_widget_page_map() function implemented

Test Results

All 15 unit tests pass:

  • test_walk_acroform_fields_no_acroform - PASS
  • test_walk_acroform_fields_no_fields_array - PASS
  • test_walk_acroform_fields_three_flat_fields - PASS
  • test_walk_acroform_fields_nested_two_levels - PASS
  • test_walk_acroform_fields_ft_inheritance - PASS
  • test_walk_acroform_fields_child_overrides_ft - PASS
  • test_walk_acroform_fields_flags_inheritance - PASS
  • test_walk_acroform_fields_empty_t_segment_skipped - PASS
  • test_walk_acroform_fields_choice_field_options - PASS
  • test_walk_acroform_fields_all_field_types - PASS
  • test_acro_field_type_from_name - PASS
  • test_acro_field_type_as_str - PASS
  • test_acro_form_field_is_checked - PASS
  • test_acro_form_field_flag_accessors - PASS
  • test_acro_form_field_btn_flag_accessors - PASS

Files Modified

  • crates/pdftract-core/src/forms/mod.rs - NEW (1022 lines)
  • crates/pdftract-core/src/lib.rs - Added forms module and re-exports

Commit

This implementation is ready for commit. The shared API can be used by:

  • Phase 7.3 (signature discovery): Filter to field_type == AcroFieldType::Sig
  • Phase 7.4 (form fields): Use all field types for complete form extraction

Next Steps

The signature module (signature/mod.rs) can be refactored to use this shared API instead of its internal walk_acroform_fields function.