This commit creates user-facing documentation for each built-in profile: - Profile YAML files defining match criteria, priority, and extracted fields - Per-profile READMEs with match criteria summary, extracted fields table, known limitations, sample input pointers, and configuration tips - xtask skeleton generator for automated README generation Profiles documented: - invoice: Commercial invoices with line items, vendor/customer, totals - receipt: POS receipts with items, payment method - contract: Legal contracts with parties, effective date, term, signatures - scientific_paper: Academic papers with title, authors, abstract, DOI, references - slide_deck: Presentation slides with title, presenter, date, slide titles - form: Fillable forms (degenerate case: uses Phase 7.4 form_fields) - bank_statement: Bank statements with account info, period, balances, transactions - legal_filing: Court filings with case number, court, parties, filing date, docket - book_chapter: Book chapters with title, chapter number, author, section headings Closes: pdftract-4iier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.7 KiB
2.7 KiB
pdftract-4iier: Profile README Documentation
Summary
Created per-profile README documentation for all 9 built-in profiles.
Files Created
Profile YAML Files (9)
profiles/builtin/invoice/profile.yaml- Invoice with line items, vendor/customer, totalsprofiles/builtin/receipt/profile.yaml- POS receipt with items, payment methodprofiles/builtin/contract/profile.yaml- Legal contract with parties, effective date, term, signaturesprofiles/builtin/scientific_paper/profile.yaml- Academic paper with title, authors, abstract, DOI, referencesprofiles/builtin/slide_deck/profile.yaml- Presentation slides with title, presenter, date, slide titlesprofiles/builtin/form/profile.yaml- Fillable form (degenerate case: no field extractor, uses Phase 7.4 form_fields)profiles/builtin/bank_statement/profile.yaml- Bank statement with account info, period, balances, transactionsprofiles/builtin/legal_filing/profile.yaml- Court filing with case number, court, parties, filing date, docketprofiles/builtin/book_chapter/profile.yaml- Book chapter with title, chapter number, author, section headings
Profile README Files (9)
Each README follows the consistent 6-section structure:
- Title and one-line description
- Match Criteria Summary - prose description of matching signals
- Extracted Fields - table with field_name, type, description, example_value, source_location_hint
- Known Limitations - document-specific edge cases and failure modes
- Sample Input - pointer to fixtures
- Configuration Tips - how to override via
--profileor export/edit
xtask Skeleton Generator
xtask/Cargo.toml- Cargo manifest for xtask binaryxtask/src/main.rs- Rust code forxtask doc-profile <name>andxtask doc-profilescommands
Acceptance Criteria Status
- ✅ All nine README files exist at the documented paths
- ✅ Each follows the consistent 6-section structure
- ✅ Extracted Fields tables match the corresponding profile YAML's profile_fields
- ✅ Known Limitations is non-empty and document-specific for each profile
- ✅ Sample Input Pointer links to actual fixtures in tests/fixtures/classifier/ or tests/fixtures/profiles/
- ✅ xtask doc-profile skeleton generator scripted (Rust code in xtask/)
Notes
- The form profile README correctly documents that it's a degenerate case (no field extractor, uses Phase 7.4 form_fields)
- The slide_deck README notes that extraction quality depends heavily on the PDF exporter
- Each profile's Known Limitations section is comprehensive and specific to that document type
- All READMEs reference docs/research/document-classification-and-zone-labeling.md for classifier theory
- The xtask generator is a starting point; it would need workspace integration to build/run