pdftract/profiles/builtin/invoice/README.md
jedarden eec40dad15 docs(pdftract-4iier): complete per-profile README documentation
Add comprehensive README files for all 9 built-in profiles (invoice,
receipt, contract, scientific_paper, slide_deck, form, bank_statement,
legal_filing, book_chapter). Each README includes:
- Match Criteria Summary: prose description of what makes a document match
- Extracted Fields table: field_name, type, description, example, source_hint
- Known Limitations: bullet list of edge cases and failure modes
- Sample Input Pointer: links to fixtures directory
- Configuration Tips: how to override via --profile or export

The xtask doc-profile skeleton generator was already implemented
and was used to generate the initial skeleton, which was then enhanced
with profile-specific human-authored content.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 00:35:35 -04:00

53 lines
3.2 KiB
Markdown

# INVOICE Profile
Commercial invoice with line items, vendor/customer, and totals
## Match Criteria Summary
A document matches this profile when it exhibits the classic structure of a commercial invoice. The classifier looks for explicit invoice terminology such as "invoice", "tax invoice", or "bill to", often paired with vendor/supplier information and customer details. Key indicators include invoice numbers, line item tables (the most reliable structural signal), and payment terms. Page counts typically range from 1-5 pages, with single-page invoices being most common. The presence of line items arranged in tabular format with quantities, unit prices, and amounts is a strong structural signal.
## Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|-------|------|-------------|----------------|-------------|
| invoice_number | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| vendor | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| customer | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| invoice_date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns |
| due_date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns |
| total | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| subtotal | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| tax | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| line_items | array | Extracted from page text using pattern matching | [...] | table: largest_table_or_bottom_half |
## Known Limitations
- Multi-currency invoices may extract the wrong total if currency symbol layout is unusual
- Line items with complex descriptions spanning multiple rows may be truncated or split incorrectly
- Invoices with nested line items (e.g., assemblies with components) may extract only top-level items
- Handwritten invoices or scans with poor OCR quality will have significantly reduced extraction accuracy
- Invoices where vendor/customer information is in logos rather than text may fail to extract those fields
- Credit notes (negative invoices) are not distinguished from regular invoices
- Invoices with multiple tax rates (e.g., different VAT rates) may capture only the aggregated tax total
## Sample Input
Example fixtures demonstrating this profile are available in `tests/fixtures/profiles/invoice/`.
*See the classifier corpus for representative documents.*
## Configuration Tips
To override this profile:
```bash
pdftract profiles export invoice > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
```
For international invoices, you may want to add region-specific text patterns to the `match.text_patterns` list. For invoices with custom fields, add new entries to `profile_fields` with appropriate regex patterns.
---
*This README was auto-generated from `profile.yaml`. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.*