pdftract/profiles/builtin/bank_statement
jedarden 8b5dd4febb docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles
This commit creates user-facing documentation for each built-in profile:

- Profile YAML files defining match criteria, priority, and extracted fields
- Per-profile READMEs with match criteria summary, extracted fields table,
  known limitations, sample input pointers, and configuration tips
- xtask skeleton generator for automated README generation

Profiles documented:
- invoice: Commercial invoices with line items, vendor/customer, totals
- receipt: POS receipts with items, payment method
- contract: Legal contracts with parties, effective date, term, signatures
- scientific_paper: Academic papers with title, authors, abstract, DOI, references
- slide_deck: Presentation slides with title, presenter, date, slide titles
- form: Fillable forms (degenerate case: uses Phase 7.4 form_fields)
- bank_statement: Bank statements with account info, period, balances, transactions
- legal_filing: Court filings with case number, court, parties, filing date, docket
- book_chapter: Book chapters with title, chapter number, author, section headings

Closes: pdftract-4iier
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:19:00 -04:00
..
profile.yaml docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles 2026-05-17 23:19:00 -04:00
README.md docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles 2026-05-17 23:19:00 -04:00

BANK_STATEMENT Profile

Bank statement with account info, period, balances, transactions

Match Criteria Summary

Documents matching this profile typically contain:

  • Strong text signals: Words like "statement of account", "bank statement", "account statement", "transaction history"
  • Structural signals: Monetary columnar layout, date columns, tabular transaction data
  • Page count: Usually 1-10 pages (monthly statements vary by account activity)
  • Layout patterns: Account info at top, statement period, opening balance, transaction list table, closing balance

The classifier looks for bank statement terminology combined with monetary tabular data. Documents with "statement" terminology AND date/amount columns match with highest confidence.

Extracted Fields

Field Type Description Example Value Source Hint
account_number string Account identifier "****1234" or "8374921" "account" or "acct" field, often masked
statement_period string Date range covered "January 1 - January 31, 2024" "statement period" field
opening_balance decimal Balance at period start 1500.00 "opening balance" or "beginning balance"
closing_balance decimal Balance at period end 1750.00 "closing balance", "ending balance", or "current balance"
transactions array Transaction records [{date: "2024-01-15", description: "GROCERY STORE", amount: -85.32, balance: 1664.68}] Largest table or central body

Known Limitations

  • Multi-currency accounts: Transactions in multiple currencies may extract incorrectly
  • Continued statements: Statements spanning multiple PDF files (e.g., "continued on next page") may not link correctly
  • Scanned statements: Poor OCR quality can lead to transaction parsing errors, especially with dense tables
  • Non-standard layouts: Statements with unusual layouts (e.g., credit card statements with tiered rates) may not extract correctly
  • Pending transactions: Pending or holds transactions may be excluded or formatted differently
  • Transaction descriptions: Very long descriptions may be truncated or wrapped incorrectly
  • Multiple accounts: Statements covering multiple accounts (e.g., combined statements) may not separate accounts correctly
  • Non-English statements: Statements in other languages may not match pattern lists
  • Check images: Statements with embedded check images may have OCR artifacts

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/classifier/misc/.

Bank statement fixtures are typically multi-page documents with transaction tables.

Configuration Tips

To override this profile for custom bank statement formats:

pdftract profiles export bank_statement > my-statement.yaml
# Edit my-statement.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-statement.yaml document.pdf

Common customizations:

  • Add bank-specific account number patterns to account_number.extraction.patterns
  • Adjust transactions.extraction.table_region for non-standard table placement
  • For credit card statements, add fields like minimum_payment, payment_due_date

This README documents the built-in bank_statement profile. See docs/research/document-classification-and-zone-labeling.md for classifier theory.