Add comprehensive README files for all 9 built-in profiles (invoice, receipt, contract, scientific_paper, slide_deck, form, bank_statement, legal_filing, book_chapter). Each README includes: - Match Criteria Summary: prose description of what makes a document match - Extracted Fields table: field_name, type, description, example, source_hint - Known Limitations: bullet list of edge cases and failure modes - Sample Input Pointer: links to fixtures directory - Configuration Tips: how to override via --profile or export The xtask doc-profile skeleton generator was already implemented and was used to generate the initial skeleton, which was then enhanced with profile-specific human-authored content. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| profile.yaml | ||
| README.md | ||
INVOICE Profile
Commercial invoice with line items, vendor/customer, and totals
Match Criteria Summary
A document matches this profile when it exhibits the classic structure of a commercial invoice. The classifier looks for explicit invoice terminology such as "invoice", "tax invoice", or "bill to", often paired with vendor/supplier information and customer details. Key indicators include invoice numbers, line item tables (the most reliable structural signal), and payment terms. Page counts typically range from 1-5 pages, with single-page invoices being most common. The presence of line items arranged in tabular format with quantities, unit prices, and amounts is a strong structural signal.
Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|---|---|---|---|---|
| invoice_number | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| vendor | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| customer | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| invoice_date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns |
| due_date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns |
| total | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| subtotal | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| tax | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| line_items | array | Extracted from page text using pattern matching | [...] | table: largest_table_or_bottom_half |
Known Limitations
- Multi-currency invoices may extract the wrong total if currency symbol layout is unusual
- Line items with complex descriptions spanning multiple rows may be truncated or split incorrectly
- Invoices with nested line items (e.g., assemblies with components) may extract only top-level items
- Handwritten invoices or scans with poor OCR quality will have significantly reduced extraction accuracy
- Invoices where vendor/customer information is in logos rather than text may fail to extract those fields
- Credit notes (negative invoices) are not distinguished from regular invoices
- Invoices with multiple tax rates (e.g., different VAT rates) may capture only the aggregated tax total
Sample Input
Example fixtures demonstrating this profile are available in tests/fixtures/profiles/invoice/.
See the classifier corpus for representative documents.
Configuration Tips
To override this profile:
pdftract profiles export invoice > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
For international invoices, you may want to add region-specific text patterns to the match.text_patterns list. For invoices with custom fields, add new entries to profile_fields with appropriate regex patterns.
This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.