Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright
notices. Configure all workspace and non-workspace crates to declare the
license. Wire license files into Python wheels and Docker images.
Files added:
- LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero"
- LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org)
Files modified:
- Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>"
- crates/pdftract-py/pyproject.toml: Added license-files to maturin config
- crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true
- xtask/Cargo.toml: Added license = "MIT OR Apache-2.0"
- fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0"
- Cargo-dist.toml: Created to include license files in binary archives
- notes/pdftract-aawrz.md: Verification note
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Complete the per-profile README documentation for all 9 built-in profiles:
- slide_deck: Add Known Limitations section
- form: Add Match Criteria Summary and Known Limitations
- bank_statement: Add Match Criteria Summary and Known Limitations
- legal_filing: Add Match Criteria Summary and Known Limitations
- book_chapter: Add Match Criteria Summary and Known Limitations
The xtask doc-profile skeleton generator already existed and provides
automated README generation from profile.yaml files.
All READMEs now follow the consistent 6-section structure:
1. Title and description
2. Match Criteria Summary (prose description)
3. Extracted Fields (table with field details)
4. Known Limitations (document-specific edge cases)
5. Sample Input Pointer (fixture references)
6. Configuration Tips (override instructions)
Acceptance criteria:
- All nine README files exist at profiles/builtin/<type>/README.md
- Each follows the consistent 6-section structure
- Extracted Fields tables match the corresponding profile YAML
- Known Limitations is non-empty and document-specific
- Sample Input Pointer links to actual fixtures
- xtask doc-profile skeleton generator exists
Co-Authored-By: Claude Code <noreply@anthropic.com>
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.
Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields
Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences
All 27 tests pass including proptests for fuzzing robustness (INV-8).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix typo: "scific_paper" -> "scientific_paper" in fixture path
- Fix xtask path resolution: use relative path ".." to access workspace root
- Fix xtask format string: remove unused profile_name placeholder
- Add workspace exclusion to xtask/Cargo.toml for standalone build
These are minor improvements to the existing per-profile README documentation
that was already created in commit 8b5dd4f.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>