Commit graph

4 commits

Author SHA1 Message Date
jedarden
c621947686 feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement
Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF
extraction, analogous to cargo-bloat for binary size.

Changes:
- CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB)
- CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB)
- CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow
- xtask: Implement memory-ceiling command with peak RSS sampling
- Add perf fixtures (100-page, 10k-page) for memory testing
- Add run-fuzz-with-limits.sh for local fuzz testing with memory caps
- Register perf fixtures in PROVENANCE.md

Memory budgets enforced:
- Buffered 100-page PDF: < 512 MB
- Streaming mode: < 256 MB (constant in page count)
- Adversarial fixtures: < 1 GB hard ceiling

Closes bf-1g1fd

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:22:55 -04:00
jedarden
58a177d3b4 docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files
Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright
notices. Configure all workspace and non-workspace crates to declare the
license. Wire license files into Python wheels and Docker images.

Files added:
- LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero"
- LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org)

Files modified:
- Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>"
- crates/pdftract-py/pyproject.toml: Added license-files to maturin config
- crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true
- xtask/Cargo.toml: Added license = "MIT OR Apache-2.0"
- fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0"
- Cargo-dist.toml: Created to include license files in binary archives
- notes/pdftract-aawrz.md: Verification note

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:36:28 -04:00
jedarden
17f581897f fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling
- Fix typo: "scific_paper" -> "scientific_paper" in fixture path
- Fix xtask path resolution: use relative path ".." to access workspace root
- Fix xtask format string: remove unused profile_name placeholder
- Add workspace exclusion to xtask/Cargo.toml for standalone build

These are minor improvements to the existing per-profile README documentation
that was already created in commit 8b5dd4f.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:22:39 -04:00
jedarden
8b5dd4febb docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles
This commit creates user-facing documentation for each built-in profile:

- Profile YAML files defining match criteria, priority, and extracted fields
- Per-profile READMEs with match criteria summary, extracted fields table,
  known limitations, sample input pointers, and configuration tips
- xtask skeleton generator for automated README generation

Profiles documented:
- invoice: Commercial invoices with line items, vendor/customer, totals
- receipt: POS receipts with items, payment method
- contract: Legal contracts with parties, effective date, term, signatures
- scientific_paper: Academic papers with title, authors, abstract, DOI, references
- slide_deck: Presentation slides with title, presenter, date, slide titles
- form: Fillable forms (degenerate case: uses Phase 7.4 form_fields)
- bank_statement: Bank statements with account info, period, balances, transactions
- legal_filing: Court filings with case number, court, parties, filing date, docket
- book_chapter: Book chapters with title, chapter number, author, section headings

Closes: pdftract-4iier
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:19:00 -04:00