This commit creates user-facing documentation for each built-in profile: - Profile YAML files defining match criteria, priority, and extracted fields - Per-profile READMEs with match criteria summary, extracted fields table, known limitations, sample input pointers, and configuration tips - xtask skeleton generator for automated README generation Profiles documented: - invoice: Commercial invoices with line items, vendor/customer, totals - receipt: POS receipts with items, payment method - contract: Legal contracts with parties, effective date, term, signatures - scientific_paper: Academic papers with title, authors, abstract, DOI, references - slide_deck: Presentation slides with title, presenter, date, slide titles - form: Fillable forms (degenerate case: uses Phase 7.4 form_fields) - bank_statement: Bank statements with account info, period, balances, transactions - legal_filing: Court filings with case number, court, parties, filing date, docket - book_chapter: Book chapters with title, chapter number, author, section headings Closes: pdftract-4iier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| profile.yaml | ||
| README.md | ||
INVOICE Profile
Commercial invoice with line items, vendor/customer, and totals
Match Criteria Summary
Documents matching this profile typically contain:
- Strong text signals: Words like "invoice", "bill to", "invoice #", "tax invoice", "due date", "purchase order"
- Structural signals: Presence of a line item table (detected as the largest table or in the bottom half of the first page)
- Page count: Usually 1-5 pages (invoices are rarely longer)
- Layout patterns: Vendor information at top, billing details, line items table, and totals at bottom
The classifier looks for invoice-specific terminology combined with tabular data structures. Documents with both "invoice" terminology AND monetary tables match with highest confidence.
Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|---|---|---|---|---|
| invoice_number | string | Unique invoice identifier | "INV-2024-0154" | Regex patterns: invoice\s*[#:]?\s*([A-Z0-9-]+) |
| vendor | string | Company issuing the invoice | "Acme Supplies Inc." | Regex patterns: vendor/supplier/company fields |
| customer | string | Company billed to | "Global Tech Corp." | Regex patterns: "bill to" section |
| invoice_date | date | Date invoice was issued | 2024-01-15 | Regex patterns: "invoice date" field |
| due_date | date | Payment deadline | 2024-02-14 | Regex patterns: "due date" or "payment due" fields |
| total | decimal | Total amount due | 1250.00 | Regex patterns: "total" or "amount due" fields |
| subtotal | decimal | Amount before tax | 1000.00 | Regex patterns: "subtotal" field |
| tax | decimal | Tax amount | 250.00 | Regex patterns: "tax", "vat", "gst" fields |
| line_items | array | Array of line item objects | [{description: "Widget", quantity: 10, unit_price: 100.00, amount: 1000.00}] |
Table extraction from largest table |
Known Limitations
- Multi-currency invoices: May extract the wrong total if currency symbols appear in multiple places; the profile matches the first currency symbol near "total"
- Complex line items: Line items spanning multiple rows (e.g., multi-line descriptions) may be split incorrectly; table extraction assumes single-row items
- Handwritten or scanned invoices: OCR errors can cause missed fields; the profile relies on clean text extraction
- Non-standard layouts: Invoices with line items on multiple pages may only extract items from the first page
- Multiple invoices in one PDF: Only the first invoice-like structure is extracted
- Discount handling: Discounts are not explicitly extracted; they may appear as negative line items or be missed entirely
- Invoice variations: Non-English invoices (e.g., "factura", "rechnung") may not match if the pattern list isn't localized
Sample Input
Example fixtures demonstrating this profile are available in tests/fixtures/classifier/invoice/.
The corpus includes 50 invoice documents covering various formats and layouts.
Configuration Tips
To override this profile for custom invoice formats:
pdftract profiles export invoice > my-invoice.yaml
# Edit my-invoice.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-invoice.yaml document.pdf
Common customizations:
- Add company-specific invoice number patterns to
invoice_number.extraction.patterns - Adjust
line_items.extraction.table_regionif invoices use non-standard table placement - Add localized patterns for non-English invoices
This README documents the built-in invoice profile. See docs/research/document-classification-and-zone-labeling.md for classifier theory.