pdftract/docs/research/form-fields-and-annotations.md
jedarden a7673c906f Add 12 research documents covering full PDF extraction surface
Infrastructure and parsing:
- raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration,
  assisted OCR, HOCR alignment, multi-language, performance
- image-and-figure-extraction: XObjects, inline images, filter decoding,
  color spaces, geometry, form XObjects, transparency, figure detection
- form-fields-and-annotations: AcroForm types, XFA, widget appearance
  streams, rich text, annotation text, output schema
- pdf-encryption-and-security: R2-R6 key derivation, object-level
  decryption, permission flags, RustCrypto implementation approach
- page-geometry-and-document-structure: page tree, all five page boxes,
  rotation, coordinate inversion, page labels, outlines, named destinations
- optional-content-groups: OCG/OCMD visibility, usage dictionary, default
  state resolution, content stream marking, multilingual layer patterns
- invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern,
  white-on-white, zero-opacity, clipped text, color tracking
- malformed-pdf-repair-and-recovery: xref recovery, stream length repair,
  syntax tolerance, partial extraction, structured warnings

Quality and metadata:
- xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML
  parsing, conflict resolution, encrypted metadata, thumbnails
- embedded-files-and-portfolios: EmbeddedFile streams, Filespec,
  AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security
- performance-and-streaming-architecture: mmap, lazy loading, NDJSON
  streaming, rayon parallelism, font caching, axum HTTP server
- benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus
  categories, reading order scoring, regression CI, public datasets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:05:42 -04:00

170 lines
12 KiB
Markdown

# Form Fields and Annotations: AcroForm, XFA, and Annotation Text Extraction
## 1. AcroForm Overview
The document catalog (`/Type /Catalog`) may contain an `/AcroForm` dictionary. This dictionary is the root of all interactive form machinery in the document. Its primary entries are:
- **`Fields`** — an array of indirect references to field dictionaries that are direct children of the field hierarchy (the root fields). Fields not referenced here are reachable only through their parent's `Kids` array.
- **`DA`** — a document-level default appearance string, used when a field lacks its own `DA`.
- **`DR`** — a resource dictionary shared across all fields; typically contains the `/Font` sub-dictionary mapping font names used in appearance strings.
- **`NeedAppearances`** — a boolean flag; when `true`, viewers must regenerate appearance streams before rendering. A library performing text extraction should not depend on pre-generated appearances being present.
- **`XFA`** — present only in XFA documents; see section 5.
The field hierarchy is a tree. Non-terminal (intermediate) nodes group related fields and act as inheritance sources; they carry no value themselves. Terminal fields are the leaf nodes — they define a field type and hold a value. The distinction is made by the presence or absence of the `/FT` (field type) entry: a terminal field has `/FT`; a non-terminal node may omit it, inheriting from a parent or leaving it undefined.
**Inherited attributes.** A terminal field that lacks `DA`, `FT`, `Ff` (field flags), or `DV` inherits those values by walking up the `Parent` chain until a value is found or the chain is exhausted. When extracting field data, the implementation must perform this walk for each attribute independently.
## 2. Field Types
### 2.1 Text Fields (`/FT /Tx`)
A text field stores a user-entered string. Key entries:
- **`V`** — the current value, a string object (PDFDocEncoding or UTF-16BE with BOM).
- **`DV`** — the default value, same encoding rules as `V`.
- **`MaxLen`** — maximum number of characters permitted.
- **`DA`** — default appearance string: a content-stream fragment such as `/Helv 12 Tf 0 g`. The font name is resolved against the `DR` dictionary in `/AcroForm`.
- **`Ff` bit 13** (`Multiline`) — when set, the field accepts multiple lines of text. Extraction should preserve embedded newlines in `V`.
- **`Ff` bit 14** (`Password`) — the value should be treated as sensitive; some implementations may redact it.
### 2.2 Button Fields (`/FT /Btn`)
Three subtypes are distinguished by `Ff`:
- **Pushbutton** (`Ff` bit 17 set) — carries no persistent value; its purpose is to trigger actions. No text value to extract.
- **Checkbox** — `V` holds the current appearance state name (e.g., `/Yes` or `/Off`); `DV` holds the default. The `AS` entry in the widget annotation mirrors the checked state and is the authoritative indicator when rendering; extraction should prefer `V` on the field, cross-referencing `AS` to confirm.
- **Radio group** — a non-terminal field node whose `Kids` are individual radio button widgets. Each kid widget has an `AS` entry whose value matches the export value when selected. The parent's `V` holds the export value of the currently selected option. To find the selected label, match `V` against the `AS` values of all kids.
### 2.3 Choice Fields (`/FT /Ch`)
- **`Opt`** — an array of options. Each element is either a string (the export value equals the display value) or a two-element array `[export_value, display_string]`.
- **`V`** — a string (single selection) or array of strings (multi-select when `Ff` bit 22 is set). Contains the export value of the selected option(s).
- **`TI`** — top index; the first visible option in a scrollable listbox.
- **`Ff` bit 18** — when set, the field is a combo box rather than a listbox.
To extract the display text for a selection, locate the entry in `Opt` whose export value matches `V`.
### 2.4 Signature Fields (`/FT /Sig`)
`V` is a signature dictionary, not a string. Text extraction is out of scope for signature fields; record the field name and type, but emit no text value.
## 3. Field Value Extraction
**String decoding.** A string value in a field is encoded in either PDFDocEncoding or UTF-16BE. The BOM `\xFE\xFF` at the start of the byte sequence signals UTF-16BE; otherwise, assume PDFDocEncoding. Implement a lookup table for the 39 PDFDocEncoding code points that differ from Latin-1.
**Text fields.** Read `V` directly. If `V` is null or absent, fall back to `DV`. If both are absent, emit an empty string.
**Checkboxes.** Read `V` from the field; the value is a name object. Any value other than `/Off` (the conventional unchecked state) indicates a checked state. Confirm against `AS` in the widget annotation.
**Radio buttons.** Read `V` from the parent field. Walk `Kids`; the selected kid is the one whose `AS` matches `V`. Emit the matching export value or display string from `Opt` if present.
**Choice fields.** Read `V` (string or array). For each selected export value, find the corresponding display string in `Opt`. If `Opt` contains plain strings, export value equals display string.
## 4. Widget Annotations
Every terminal field is associated with one or more widget annotations. A field may merge with its single widget (the same dictionary object serves both roles) or the field may have a `Kids` array of separate widget dictionaries, each with `/Subtype /Widget`. Widgets carry:
- **`Rect`** — a four-element array `[x1 y1 x2 y2]` in default user space units, giving the bounding box of the field on the page. This is the `bbox` used in output.
- **`P`** — indirect reference to the page object on which the widget appears.
- **`AP`** — appearance dictionary with up to three sub-dictionaries: `N` (normal), `R` (rollover), `D` (down). Each entry is either a Form XObject stream or a sub-dictionary keyed by appearance state names (used for checkboxes and radio buttons).
**Extracting text from appearance streams.** When `V` is absent or when the document sets `NeedAppearances false` and has pre-generated streams, the `N` appearance stream is a Form XObject containing a content stream. This stream can be processed identically to a page content stream: extract text operators (`Tj`, `TJ`, `'`, `"`) using the font resources in the stream's own `/Resources` dictionary. This is the fallback path for fields whose value is encoded only in the rendered appearance.
## 5. XFA Forms
The `/XFA` entry in `/AcroForm` contains the XFA form data. Its value is either a single stream (the complete XFA document as XML) or an array of alternating name/stream pairs representing named XFA packets:
```
[ /xdp:xdp stream /template stream /datasets stream /config stream ... ]
```
XFA versions range from 2.0 through 3.3; the version is declared in the `xdp:xdp` root element's namespace URI.
**Relevant packets:**
- **`template`** — defines the form structure: field names, types, binding expressions, and layout. Field names follow XPath-like dot-notation (`form1.page1.subform1.field1`).
- **`datasets`** — contains the actual data bound to the template. The `xfa:data` element holds a tree of XML elements whose tag names and text content correspond to field values.
**Extraction algorithm for XFA.** Parse the `datasets` XML; walk the element tree depth-first. For each leaf text node, construct its full path by joining ancestor element names with `.`. Emit `(path, text_content)` pairs. For structured arrays, the XFA spec uses sibling elements with the same tag name; track occurrence indices.
**Hybrid XFA documents.** Some PDFs contain both `/XFA` and an AcroForm `Fields` array. The AcroForm fields serve as a compatibility layer for viewers that do not support XFA. When `/XFA` is present, prefer XFA data extraction; the AcroForm values may be stale or absent.
## 6. Annotation Types Relevant to Text Extraction
Annotations are listed in the `Annots` array of a page dictionary. Each annotation dictionary has `/Type /Annot` and a `/Subtype` that determines its semantics.
- **`Text`** (sticky note) — `Contents` holds the annotation text; `T` holds the author name; `RC` may hold rich text.
- **`FreeText`** — text rendered directly on the page surface. `Contents` is the plain text; `DA` and `DS` control styling; `RC` may carry formatted content.
- **Markup annotations** (`Highlight`, `Underline`, `Squiggly`, `StrikeOut`) — these reference existing page text via `QuadPoints`, an array of 8n numbers defining n quadrilaterals over the marked text. `Contents` carries the reviewer's comment.
- **`Link`** — `Contents` may hold descriptive text; the `A` entry holds an action dictionary (`/S /URI` with a `URI` string, or `/S /GoTo` with a destination).
- **`Stamp`** — `Contents` is the stamp text (e.g., "Approved").
- **`Popup`** — associated with a markup annotation via `Parent`; `Contents` mirrors the parent's comment. Skip independently; capture through the parent.
## 7. Rich Text (`RC` Field)
The `RC` entry in both annotation dictionaries and text field dictionaries holds an XHTML-like string defined by PDF spec §12.7.3.4. The markup uses a restricted subset: `<body>`, `<p>`, `<span>` elements with inline style attributes (`font-family`, `font-size`, `font-weight`, `font-style`, `color`, `text-decoration`).
**Plain-text extraction.** Parse the XML, discard all tags, and concatenate text node content. `<p>` boundaries map to newlines. `<br/>` within a paragraph maps to a newline.
**Formatted extraction.** For callers that want span metadata, capture each `<span>` with its computed style. The style attribute follows a semicolon-separated CSS-like syntax; parse it into a key-value map. Relevant keys: `font-weight: bold`, `font-style: italic`, `color: #rrggbb` or `color: rgb(r,g,b)`.
When both `RC` and `Contents` are present, `RC` is the richer source. When `RC` is absent, fall back to `Contents`.
## 8. Extracting Annotation Text
**Iteration.** For each page, read the `Annots` array. Each element is an indirect reference to an annotation dictionary. Resolve each reference and filter by `Subtype`.
**Fields to extract per annotation:**
| Entry | Meaning |
|---|---|
| `Contents` | Primary text content |
| `RC` | Rich text override (parse for plain text) |
| `T` | Author / title |
| `Subtype` | Annotation kind |
| `Rect` | Bounding box on the page |
| `QuadPoints` | Highlighted region (markup annotations only) |
**Spatial ordering.** To interleave annotation text with body text, compute the center of `Rect` (or the centroid of all `QuadPoints` quads) and sort annotations by their vertical position (descending `y`) then horizontal position (ascending `x`), matching the reading-order convention used for body text.
**Markup annotation text recovery.** For `Highlight`, `Underline`, `Squiggly`, and `StrikeOut`, the `QuadPoints` array identifies the page content already extracted by the main text extraction pipeline. A library can optionally resolve these quads against the extracted glyph positions to return the marked span as a first-class excerpt, in addition to the `Contents` comment.
## 9. Output Representation
**Form fields.** Emit a top-level `form_fields` array. Each entry is a struct:
```rust
pub struct FormField {
pub name: String, // fully qualified field name (dot-joined)
pub field_type: FieldType, // Tx | Btn | Ch | Sig
pub value: Option<String>, // decoded current value
pub default_value: Option<String>,
pub page: Option<u32>, // 0-indexed page number from widget P entry
pub bbox: Option<[f32; 4]>,// [x1, y1, x2, y2] from widget Rect
}
```
**Annotations.** Emit an `annotations` array per page:
```rust
pub struct Annotation {
pub kind: AnnotationKind, // Text | FreeText | Highlight | ...
pub contents: Option<String>,
pub rich_text: Option<String>, // raw RC XML if present
pub author: Option<String>,
pub bbox: [f32; 4],
pub quad_points: Vec<[f32; 8]>, // populated for markup annotations
}
```
**Caller-controlled inclusion.** Expose boolean flags on the extraction configuration:
```rust
pub struct ExtractionOptions {
pub extract_forms: bool,
pub extract_annotations: bool,
pub prefer_xfa: bool, // when XFA present, skip AcroForm field scan
}
```
When `extract_forms` is `false`, skip the AcroForm traversal entirely. When `extract_annotations` is `false`, skip the `Annots` array on each page. Both default to `true`. When `prefer_xfa` is `true` and `/XFA` is present, use XFA dataset extraction and suppress AcroForm field output to avoid duplicates.