pdftract/profiles/builtin/receipt
jedarden eec40dad15 docs(pdftract-4iier): complete per-profile README documentation
Add comprehensive README files for all 9 built-in profiles (invoice,
receipt, contract, scientific_paper, slide_deck, form, bank_statement,
legal_filing, book_chapter). Each README includes:
- Match Criteria Summary: prose description of what makes a document match
- Extracted Fields table: field_name, type, description, example, source_hint
- Known Limitations: bullet list of edge cases and failure modes
- Sample Input Pointer: links to fixtures directory
- Configuration Tips: how to override via --profile or export

The xtask doc-profile skeleton generator was already implemented
and was used to generate the initial skeleton, which was then enhanced
with profile-specific human-authored content.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 00:35:35 -04:00
..
profile.yaml docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles 2026-05-17 23:19:00 -04:00
README.md docs(pdftract-4iier): complete per-profile README documentation 2026-05-18 00:35:35 -04:00

RECEIPT Profile

Point-of-sale or purchase receipt with items, payment method

Match Criteria Summary

A document matches this profile when it displays the typical characteristics of a point-of-sale receipt. The classifier identifies receipt-specific terminology like "store receipt", "total sold", "change due", and payment method indicators. Structurally, receipts are recognized by their narrow aspect ratio (often mimicking thermal printer paper), columnar layout with monetary values, and compact single-page format. The presence of monetary columns aligned to the right side of the document is a strong structural signal. Receipts are almost always single-page documents with a vertical orientation.

Extracted Fields

Field Type Description Example Value Source Hint
merchant string Extracted from page text using pattern matching "example value" regex patterns
date date Extracted from page text using pattern matching 2024-01-15 regex patterns
total decimal Extracted from page text using pattern matching 123.45 regex patterns
tax decimal Extracted from page text using pattern matching 123.45 regex patterns
items array Extracted from page text using pattern matching [...] columns: monetary_columns
payment_method string Extracted from page text using pattern matching "example value" regex patterns

Known Limitations

  • Very long receipts (e.g., from home improvement stores) may fold across multiple scan pages, breaking extraction
  • Receipts with faint thermal print or low-resolution scans may have poor OCR quality
  • Handwritten receipts (e.g., from contractors) may not match the profile due to lack of columnar structure
  • Receipts in right-to-left languages (Arabic, Hebrew) may fail monetary column detection
  • Multi-store returns or exchange receipts with complex itemization may extract items incorrectly
  • Receipts with multiple transactions on one document (e.g., daily register tape) are not handled
  • Tip lines on restaurant receipts may be confused with subtotal/total fields

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/profiles/receipt/.

See the classifier corpus for representative documents.

Configuration Tips

To override this profile:

pdftract profiles export receipt > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf

For receipts from specific merchants with custom layouts, consider adding merchant-specific patterns to the match.text_patterns list. For receipts with unique item formats, customize the items field's extraction schema.


This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.