Commit graph

5 commits

Author SHA1 Message Date
jedarden
84b4448648 feat(pdftract-5qca): implement form_fields JSON output + schema integration
Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from
combiner into document-level /form_fields JSON output with tagged union schema.

- Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema
- Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none)
- Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction
- Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins
- Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion
- Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field
- Add form_fields_to_markdown() to markdown module for Form Fields footer table

Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?,
required, read_only, multiline?, max_length?, options?, multi_select?, selected?,
state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice",
"signature". Value field varies by type (string|boolean|string|array|uint|null).

Closes: pdftract-5qca

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:36:03 -04:00
jedarden
77f7c6a1ed feat(pdftract-66pgk): implement AcroForm Btn value extraction
Add button field value extraction distinguishing pushbutton, checkbox,
and radio button types via /Ff flags. Extracts selected state and
appearance state name (/Yes, /Off, custom).

- New module: forms/value_button.rs with ButtonKind enum and ButtonValue
- Updated FormFieldValue::Button variant with kind and state_name fields
- 15 unit tests covering all button types and edge cases
- Fixed CCITTFaxDecoder test syntax blocking test execution

Closes: pdftract-66pgk

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 11:33:23 -04:00
jedarden
a049924317 feat(pdftract-2qum): implement FormFieldValue enum and XFA-wins combiner
Implement Phase 7.4.4: AcroForm + XFA field combiner with XFA-wins
precedence. This enables pdftract to handle hybrid PDF forms that
contain both AcroForm and XFA representations.

- Add FormFieldValue enum with Text, Button, Choice, Signature variants
- Add ChoiceValue enum for single/multiple choice selections
- Implement combine() function that merges AcroForm and XFA fields
  with XFA values taking precedence on collision
- Implement XFA boolean string conversion ("true"/"false"/"1"/"0")
  to Button selected state
- Preserve AcroForm type hints when XFA provides the value
- Emit diagnostics for field name collisions
- Sort output alphabetically by field name

Closes: pdftract-2qum
2026-05-24 10:11:47 -04:00
jedarden
4f1a3e84b7 feat(pdftract-28e9): implement XFA stream parser for Phase 7.4.3
Created forms/xfa.rs module with extract_xfa_fields() that:
- Handles single-stream and array-stream XFA layouts
- Uses quick-xml for XML parsing with namespace support
- Extracts field values from XFA data model (xfa:datasets/xfa:data)
- Supports FlateDecode-compressed streams via Phase 1 decoder
- Returns Vec<XfaField> with dot-separated field names

Acceptance criteria:
- Critical test: XFA-only form field values extracted
- Unit tests: single stream, array stream, malformed XML, empty fields
- Public API: extract_xfa_fields(resolver, acroform_dict, source, opts)
- quick-xml feature flags: enabled via existing 'ocr' feature

All tests pass. Closes: pdftract-28e9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 07:20:15 -04:00
jedarden
09428e76f3 feat(pdftract-5w6i): implement AcroForm field walker with recursive walk and dot-joined names
Implements Phase 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names).

## Changes

- Create `crates/pdftract-core/src/forms/mod.rs` module with:
  - `AcroFieldType` enum (Tx, Btn, Ch, Sig, Other)
  - `AcroFormField` struct with full field metadata
  - `walk_acroform_fields()` public API function
  - Recursive DFS traversal with /FT, /V, /DV, /Ff inheritance
  - Widget annotation to page index resolution
  - Cycle detection via visited set
  - Name collision handling (keep last, emit diagnostic)
  - Choice field option extraction for Ch fields

- Update `lib.rs` to export forms module and types

## Implementation Details

- Entry point: `/Catalog /AcroForm /Fields` array
- Dot-joined names: Concatenate `/T` values with "." separator
- Inheritance: `/FT`, `/V`, `/DV`, `/Ff` from parent to child
- Page resolution: Search page `/Annots` arrays for widget annotations
- Cycle detection: `visited` HashSet prevents infinite loops on malformed PDFs
- Name collisions: Track emitted names, keep last on duplicate

## Tests

All 15 unit tests pass:
- Flat 3 fields extraction
- Nested 2-level hierarchy with dot-joined names
- /FT inheritance from parent to child
- /FT override by child
- /Ff (flags) inheritance
- Empty /T segment handling
- Choice field /Opt array parsing
- All field types (Tx, Btn, Ch, Sig)
- Flag accessor methods (is_read_only, is_required, etc.)
- Button field is_checked() method

Closes: pdftract-5w6i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 05:31:51 -04:00