pdftract/notes/pdftract-2qum.md
jedarden a049924317 feat(pdftract-2qum): implement FormFieldValue enum and XFA-wins combiner
Implement Phase 7.4.4: AcroForm + XFA field combiner with XFA-wins
precedence. This enables pdftract to handle hybrid PDF forms that
contain both AcroForm and XFA representations.

- Add FormFieldValue enum with Text, Button, Choice, Signature variants
- Add ChoiceValue enum for single/multiple choice selections
- Implement combine() function that merges AcroForm and XFA fields
  with XFA values taking precedence on collision
- Implement XFA boolean string conversion ("true"/"false"/"1"/"0")
  to Button selected state
- Preserve AcroForm type hints when XFA provides the value
- Emit diagnostics for field name collisions
- Sort output alphabetically by field name

Closes: pdftract-2qum
2026-05-24 10:11:47 -04:00

160 lines
6.2 KiB
Markdown

# pdftract-2qum: AcroForm + XFA Combiner Implementation
**Bead:** pdftract-2qum
**Title:** 7.4.4: AcroForm + XFA combiner with XFA-wins precedence
**Status:** COMPLETE
**Date:** 2026-05-24
## Summary
Implemented Phase 7.4.4: AcroForm + XFA field combiner that merges form field values from both sources with XFA-wins precedence. This enables pdftract to handle hybrid PDF forms that contain both AcroForm and XFA representations.
## Implementation
### Files Created
- `crates/pdftract-core/src/forms/combiner.rs` (385 lines)
- `FormFieldValue` enum with `Text`, `Button`, `Choice`, `Signature` variants
- `ChoiceValue` enum for single/multiple choice selections
- `combine()` function that merges AcroForm and XFA fields
- `parse_xfa_boolean()` for XFA boolean string conversion
- `merge_xfa_value_with_acro_type()` for type-preserving XFA value injection
- `infer_xfa_field_type()` for XFA-only field type inference
### Files Modified
- `crates/pdftract-core/src/forms/mod.rs`
- Added `pub mod combiner;` declaration
- Re-exported `combine`, `ChoiceValue`, `FormFieldValue`
- `crates/pdftract-core/src/lib.rs`
- Added re-exports: `combine`, `ChoiceValue`, `FormFieldValue`
## API Design
### `FormFieldValue` Enum
```rust
pub enum FormFieldValue {
Text {
value: Option<String>,
default: Option<String>,
multiline: bool,
max_length: Option<u32>,
},
Button {
selected: bool,
default_selected: Option<bool>,
is_radio: bool,
is_pushbutton: bool,
},
Choice {
value: ChoiceValue, // Single or Multiple
default: Option<ChoiceValue>,
options: Vec<(String, String)>,
is_combo: bool,
is_multi_select: bool,
},
Signature {
signature_ref: Option<u32>,
},
}
```
### `combine()` Function
```rust
pub fn combine(
acro_fields: Vec<(String, FormFieldValue)>,
xfa_fields: Vec<(String, String)>,
) -> (Vec<(String, FormFieldValue)>, Vec<Diagnostic>)
```
**Behavior:**
1. Insert AcroForm fields first
2. Insert XFA fields second (overwrites on collision)
3. Track which fields came from both sources
4. Convert XFA boolean strings ("true"/"false"/"1"/"0") to Button::selected
5. Preserve AcroForm type hints when XFA provides the value
6. Empty XFA values overwrite non-empty AcroForm values (XFA is canonical)
7. Emit diagnostic for each collision
8. Sort output alphabetically by full_name
## Acceptance Criteria Status
### Critical Test: Hybrid XFA+AcroForm - XFA values preferred
**PASS** - `test_combine_both_overlapping` verifies that XFA values overwrite AcroForm values on collision.
### Unit Tests
| Test | Status | Description |
|------|--------|-------------|
| `test_combine_no_overlap` | PASS | 3 AcroForm + 2 XFA, no overlap |
| `test_combine_both_overlapping` | PASS | 3 AcroForm + 2 XFA, both overlapping on 2 fields |
| `test_xfa_boolean_to_checkbox` | PASS | XFA boolean string converts to Button selected state |
| `test_empty_xfa_wins_over_nonempty_acro` | PASS | Empty XFA value overwrites non-empty AcroForm value |
| `test_parse_xfa_boolean` | PASS | Boolean string parsing (true/false/1/0/yes/no) |
| `test_sort_order_deterministic` | PASS | Alphabetical sorting verified |
| `test_choice_value_single` | PASS | Single choice value merge |
| `test_choice_value_multi_select` | PASS | Multi-select comma-separated parsing |
### Diagnostics
**PASS** - Collisions emit `Diagnostic` with field name, AcroForm value, and XFA value.
### Public API
**PASS** - `form_field::combine(acro, xfa) -> Vec<(String, FormFieldValue)>` is public and exported.
### Sort Order
**PASS** - Output is sorted alphabetically by full_name for deterministic ordering.
## Test Results
```bash
$ cargo test --lib forms
test result: ok. 26 passed; 0 failed; 0 ignored; 0 measured; 1504 filtered out
```
All 26 forms tests pass, including:
- 18 existing tests from `forms/mod.rs` (AcroForm field walking)
- 8 new tests from `forms/combiner.rs` (XFA combiner)
## Design Decisions
### 1. Type Preservation on Collision
When XFA overwrites an AcroForm value, we preserve the AcroForm's type metadata (multiline, max_length, is_radio, etc.) and inject only the XFA value string. This ensures that type information from the AcroForm dictionary is not lost when XFA provides the current value.
### 2. Boolean String Conversion
XFA represents boolean values as strings ("true", "false", "1", "0"). We convert these to Button::selected when the AcroForm type is Button. For XFA-only fields, we default to Text to avoid misclassifying text fields that happen to contain boolean-like strings.
### 3. Empty XFA Values Win
Per PDF 1.7 spec and Adobe Reader convention, XFA is the canonical source for form values. Even when XFA provides an empty string, it overwrites a non-empty AcroForm value. This ensures that cleared fields in XFA are represented as empty in the output.
### 4. Signature Fields Cannot Be Overridden
Signature fields (/FT /Sig) contain cryptographic signature data that cannot be represented as a string. When XFA provides a value for a signature field, we keep the AcroForm value and emit a diagnostic explaining that signatures cannot be overridden by XFA.
## Integration Points
This combiner is designed to be used by:
- **Phase 7.4.5** (pdftract-5qca): form_fields JSON output + schema integration
- **Phase 7.3** (signature discovery): filters AcroForm fields to /FT /Sig type
The `combine()` function accepts:
- AcroForm fields: `Vec<(String, FormFieldValue)>` (from Phase 7.4.2, not yet implemented)
- XFA fields: `Vec<(String, String)>` (from Phase 7.4.3, already implemented as `extract_xfa_fields`)
**Note:** Phase 7.4.2 (type-specific AcroForm value extraction) is not yet implemented. Currently, `walk_acroform_fields` returns `Vec<AcroFormField>` with raw `PdfObject` values. A future bead will implement the conversion from `AcroFormField` to `FormFieldValue`.
## References
- Plan: lines 2622-2645 (Phase 7.4 AcroForm and XFA Field Extraction)
- Plan: line 2637 ("If both AcroForm and XFA are present, prefer XFA values")
- Plan: line 2645 ("Hybrid XFA+AcroForm: XFA values preferred")
- Bead pdftract-2qum description
## Commits
- `forms: implement FormFieldValue enum and combine() function for XFA-wins precedence`
## WARN Items
None. All acceptance criteria pass.