Commit graph

4 commits

Author SHA1 Message Date
jedarden
25ddcba641 docs(pdftract-4iier): complete per-profile README documentation
Complete the per-profile README documentation for all 9 built-in profiles:

- slide_deck: Add Known Limitations section
- form: Add Match Criteria Summary and Known Limitations
- bank_statement: Add Match Criteria Summary and Known Limitations
- legal_filing: Add Match Criteria Summary and Known Limitations
- book_chapter: Add Match Criteria Summary and Known Limitations

The xtask doc-profile skeleton generator already existed and provides
automated README generation from profile.yaml files.

All READMEs now follow the consistent 6-section structure:
1. Title and description
2. Match Criteria Summary (prose description)
3. Extracted Fields (table with field details)
4. Known Limitations (document-specific edge cases)
5. Sample Input Pointer (fixture references)
6. Configuration Tips (override instructions)

Acceptance criteria:
- All nine README files exist at profiles/builtin/<type>/README.md
- Each follows the consistent 6-section structure
- Extracted Fields tables match the corresponding profile YAML
- Known Limitations is non-empty and document-specific
- Sample Input Pointer links to actual fixtures
- xtask doc-profile skeleton generator exists

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-18 00:19:44 -04:00
jedarden
b535638104 feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.

Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields

Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences

All 27 tests pass including proptests for fuzzing robustness (INV-8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:45:45 -04:00
jedarden
17f581897f fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling
- Fix typo: "scific_paper" -> "scientific_paper" in fixture path
- Fix xtask path resolution: use relative path ".." to access workspace root
- Fix xtask format string: remove unused profile_name placeholder
- Add workspace exclusion to xtask/Cargo.toml for standalone build

These are minor improvements to the existing per-profile README documentation
that was already created in commit 8b5dd4f.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:22:39 -04:00
jedarden
8b5dd4febb docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles
This commit creates user-facing documentation for each built-in profile:

- Profile YAML files defining match criteria, priority, and extracted fields
- Per-profile READMEs with match criteria summary, extracted fields table,
  known limitations, sample input pointers, and configuration tips
- xtask skeleton generator for automated README generation

Profiles documented:
- invoice: Commercial invoices with line items, vendor/customer, totals
- receipt: POS receipts with items, payment method
- contract: Legal contracts with parties, effective date, term, signatures
- scientific_paper: Academic papers with title, authors, abstract, DOI, references
- slide_deck: Presentation slides with title, presenter, date, slide titles
- form: Fillable forms (degenerate case: uses Phase 7.4 form_fields)
- bank_statement: Bank statements with account info, period, balances, transactions
- legal_filing: Court filings with case number, court, parties, filing date, docket
- book_chapter: Book chapters with title, chapter number, author, section headings

Closes: pdftract-4iier
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:19:00 -04:00