pdftract/profiles/builtin/receipt
jedarden 80dbf0f703 feat(profiles): add profile infrastructure and initial fixtures
- Add profile source modules: apply_profile, extraction, extraction_loader, field_extractor, match_eval
- Add profiles CLI subcommand (profiles_cmd.rs)
- Update all 9 built-in profile YAMLs (invoice, receipt, contract, scientific_paper, slide_deck, form, bank_statement, legal_filing, book_chapter)
- Add 50 invoice fixture PDFs
- Add 2 receipt fixture PDFs

Part of: pdftract-3a310 (Phase 7.10 coordinator)
2026-05-31 15:10:51 -04:00
..
profile.yaml feat(profiles): add profile infrastructure and initial fixtures 2026-05-31 15:10:51 -04:00
README.md docs(pdftract-4iier): complete per-profile README documentation 2026-05-18 00:35:35 -04:00

RECEIPT Profile

Point-of-sale or purchase receipt with items, payment method

Match Criteria Summary

A document matches this profile when it displays the typical characteristics of a point-of-sale receipt. The classifier identifies receipt-specific terminology like "store receipt", "total sold", "change due", and payment method indicators. Structurally, receipts are recognized by their narrow aspect ratio (often mimicking thermal printer paper), columnar layout with monetary values, and compact single-page format. The presence of monetary columns aligned to the right side of the document is a strong structural signal. Receipts are almost always single-page documents with a vertical orientation.

Extracted Fields

Field Type Description Example Value Source Hint
merchant string Extracted from page text using pattern matching "example value" regex patterns
date date Extracted from page text using pattern matching 2024-01-15 regex patterns
total decimal Extracted from page text using pattern matching 123.45 regex patterns
tax decimal Extracted from page text using pattern matching 123.45 regex patterns
items array Extracted from page text using pattern matching [...] columns: monetary_columns
payment_method string Extracted from page text using pattern matching "example value" regex patterns

Known Limitations

  • Very long receipts (e.g., from home improvement stores) may fold across multiple scan pages, breaking extraction
  • Receipts with faint thermal print or low-resolution scans may have poor OCR quality
  • Handwritten receipts (e.g., from contractors) may not match the profile due to lack of columnar structure
  • Receipts in right-to-left languages (Arabic, Hebrew) may fail monetary column detection
  • Multi-store returns or exchange receipts with complex itemization may extract items incorrectly
  • Receipts with multiple transactions on one document (e.g., daily register tape) are not handled
  • Tip lines on restaurant receipts may be confused with subtotal/total fields

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/profiles/receipt/.

See the classifier corpus for representative documents.

Configuration Tips

To override this profile:

pdftract profiles export receipt > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf

For receipts from specific merchants with custom layouts, consider adding merchant-specific patterns to the match.text_patterns list. For receipts with unique item formats, customize the items field's extraction schema.


This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.