This commit creates user-facing documentation for each built-in profile: - Profile YAML files defining match criteria, priority, and extracted fields - Per-profile READMEs with match criteria summary, extracted fields table, known limitations, sample input pointers, and configuration tips - xtask skeleton generator for automated README generation Profiles documented: - invoice: Commercial invoices with line items, vendor/customer, totals - receipt: POS receipts with items, payment method - contract: Legal contracts with parties, effective date, term, signatures - scientific_paper: Academic papers with title, authors, abstract, DOI, references - slide_deck: Presentation slides with title, presenter, date, slide titles - form: Fillable forms (degenerate case: uses Phase 7.4 form_fields) - bank_statement: Bank statements with account info, period, balances, transactions - legal_filing: Court filings with case number, court, parties, filing date, docket - book_chapter: Book chapters with title, chapter number, author, section headings Closes: pdftract-4iier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| profile.yaml | ||
| README.md | ||
FORM Profile
Fillable form with fields; uses line_dominant reading order and form_fields from Phase 7.4
Match Criteria Summary
Documents matching this profile typically contain:
- Strong text signals: Words like "form 1099", "application form", "questionnaire", "please fill out", "required fields"
- Structural signals: Form field layout (blanks, checkboxes, labeled input areas), blank lines with colons
- Page count: Usually 1-10 pages (forms are typically concise)
- Layout patterns: Labels followed by blanks/underlines, checkboxes, signature blocks, structured fields
The classifier looks for form-specific terminology combined with field layout patterns. Documents with "form" terminology AND blank fields match with highest confidence.
Note: This is a degenerate profile with no field extractors. It uses line_dominant reading order and surfaces all form_fields from Phase 7.4. The profile enables form-specific processing but does not extract named fields like other profiles.
Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|---|---|---|---|---|
| (none) | - | This profile has no field extractors | - | - |
Instead of named fields, this profile integrates with Phase 7.4's form_fields system, which extracts:
- Text input fields (labels + values)
- Checkbox/radio button states
- Signature blocks
- Date fields
- Multi-line text areas
See Phase 7.4 documentation for the form_fields schema.
Known Limitations
- No named extraction: Unlike other profiles, this does not return named fields; users must process
form_fieldsoutput - Handwritten forms: Handwritten responses may not be OCRed correctly
- Complex layouts: Forms with non-standard layouts (e.g., grids, nested sections) may confuse field detection
- Checkboxes and radio buttons: Checkbox states may be unreliable depending on PDF encoding
- Multi-page forms: Fields spanning page boundaries may be split incorrectly
- Non-English forms: Forms in other languages may not match pattern lists
- Scanned forms: Poor scan quality can lead to missed fields or incorrect label-value pairing
- Dynamic forms: Forms with conditional fields (e.g., "if yes, go to section B") are not interpreted
Sample Input
Example fixtures demonstrating this profile are available in tests/fixtures/classifier/misc/.
Form fixtures are typically single-page documents with labeled fields and blanks for user input.
Configuration Tips
To override this profile for custom form formats:
pdftract profiles export form > my-form.yaml
# Edit my-form.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-form.yaml document.pdf
Common customizations:
- Add form-specific patterns to
match.text_patternsfor proprietary form types - If you need named field extraction, copy this profile and add
profile_fieldsentries - For government forms (e.g., IRS, USCIS), create specific profiles with known field mappings
Integration with Phase 7.4: This profile sets form_fields_integration: true, which enables the form field extraction pipeline. The extracted form_fields array is included in the output JSON.
This README documents the built-in form profile. See docs/research/document-classification-and-zone-labeling.md for classifier theory and Phase 7.4 for form_fields schema.