History

jedarden eec40dad15 docs(pdftract-4iier): complete per-profile README documentation Add comprehensive README files for all 9 built-in profiles (invoice, receipt, contract, scientific_paper, slide_deck, form, bank_statement, legal_filing, book_chapter). Each README includes: - Match Criteria Summary: prose description of what makes a document match - Extracted Fields table: field_name, type, description, example, source_hint - Known Limitations: bullet list of edge cases and failure modes - Sample Input Pointer: links to fixtures directory - Configuration Tips: how to override via --profile or export The xtask doc-profile skeleton generator was already implemented and was used to generate the initial skeleton, which was then enhanced with profile-specific human-authored content. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 00:35:35 -04:00
..
profile.yaml	docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles	2026-05-17 23:19:00 -04:00
README.md	docs(pdftract-4iier): complete per-profile README documentation	2026-05-18 00:35:35 -04:00

jedarden eec40dad15 docs(pdftract-4iier): complete per-profile README documentation

Add comprehensive README files for all 9 built-in profiles (invoice,
receipt, contract, scientific_paper, slide_deck, form, bank_statement,
legal_filing, book_chapter). Each README includes:
- Match Criteria Summary: prose description of what makes a document match
- Extracted Fields table: field_name, type, description, example, source_hint
- Known Limitations: bullet list of edge cases and failure modes
- Sample Input Pointer: links to fixtures directory
- Configuration Tips: how to override via --profile or export

The xtask doc-profile skeleton generator was already implemented
and was used to generate the initial skeleton, which was then enhanced
with profile-specific human-authored content.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 00:35:35 -04:00

profile.yaml

docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles

2026-05-17 23:19:00 -04:00

README.md

docs(pdftract-4iier): complete per-profile README documentation

2026-05-18 00:35:35 -04:00

README.md

FORM Profile

Fillable form with fields; uses line_dominant reading order and form_fields from Phase 7.4

Match Criteria Summary

A document matches this profile when it exhibits the structure of a fillable form or questionnaire. The classifier identifies form-specific terminology like "form" with alphanumeric identifiers, "application form", "questionnaire", and instructions like "please fill out" or "required fields". Structurally, forms are recognized by their field layout (labels followed by blank spaces or boxes) and the presence of colon-terminated field labels. Forms typically range from 1-10 pages and may include checkboxes, radio buttons, and lined or boxed areas for handwritten responses. This profile is a degenerate case: it has no profile field extractors, instead relying on the form_fields extraction from Phase 7.4.

Extracted Fields

Field	Type	Description	Example Value	Source Hint
(none)	-	This profile has no field extractors	-	-

Note: This profile does not define extracted fields in profile_fields. Instead, it uses form_fields_integration: true to leverage the generic form field extraction from Phase 7.4. Field names and values are extracted dynamically based on the form's layout (label-value pairs, checkboxes, etc.).

Known Limitations

Form field extraction depends on clear label-value relationships; poorly aligned forms may fail
Handwritten responses are not transcribed; only field labels and pre-filled values are captured
Forms with complex layouts (nested sections, conditional fields) may extract fields incorrectly
Forms without colons or clear field delimiters may not be recognized as forms
Multi-page forms with page continuations may have broken field extraction across page boundaries
Checkboxes and radio buttons are detected but their checked/unchecked state may not be reliable
Forms with tables or grids for data entry may not extract individual cell values correctly
Non-English forms may not match due to English-only text patterns in match criteria

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/profiles/form/.

See the classifier corpus for representative documents.

Configuration Tips

To override this profile:

pdftract profiles export form > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf

For specific form types (e.g., tax forms, government applications), consider creating a dedicated profile with form-specific profile_fields instead of using this generic form profile. The form_fields integration can be combined with custom field extractors for hybrid approaches.

This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.