- Add profile source modules: apply_profile, extraction, extraction_loader, field_extractor, match_eval - Add profiles CLI subcommand (profiles_cmd.rs) - Update all 9 built-in profile YAMLs (invoice, receipt, contract, scientific_paper, slide_deck, form, bank_statement, legal_filing, book_chapter) - Add 50 invoice fixture PDFs - Add 2 receipt fixture PDFs Part of: pdftract-3a310 (Phase 7.10 coordinator) |
||
|---|---|---|
| .. | ||
| profile.yaml | ||
| README.md | ||
RECEIPT Profile
Point-of-sale or purchase receipt with items, payment method
Match Criteria Summary
A document matches this profile when it displays the typical characteristics of a point-of-sale receipt. The classifier identifies receipt-specific terminology like "store receipt", "total sold", "change due", and payment method indicators. Structurally, receipts are recognized by their narrow aspect ratio (often mimicking thermal printer paper), columnar layout with monetary values, and compact single-page format. The presence of monetary columns aligned to the right side of the document is a strong structural signal. Receipts are almost always single-page documents with a vertical orientation.
Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|---|---|---|---|---|
| merchant | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns |
| total | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| tax | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| items | array | Extracted from page text using pattern matching | [...] | columns: monetary_columns |
| payment_method | string | Extracted from page text using pattern matching | "example value" | regex patterns |
Known Limitations
- Very long receipts (e.g., from home improvement stores) may fold across multiple scan pages, breaking extraction
- Receipts with faint thermal print or low-resolution scans may have poor OCR quality
- Handwritten receipts (e.g., from contractors) may not match the profile due to lack of columnar structure
- Receipts in right-to-left languages (Arabic, Hebrew) may fail monetary column detection
- Multi-store returns or exchange receipts with complex itemization may extract items incorrectly
- Receipts with multiple transactions on one document (e.g., daily register tape) are not handled
- Tip lines on restaurant receipts may be confused with subtotal/total fields
Sample Input
Example fixtures demonstrating this profile are available in tests/fixtures/profiles/receipt/.
See the classifier corpus for representative documents.
Configuration Tips
To override this profile:
pdftract profiles export receipt > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
For receipts from specific merchants with custom layouts, consider adding merchant-specific patterns to the match.text_patterns list. For receipts with unique item formats, customize the items field's extraction schema.
This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.