Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. Changes: - CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB) - CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB) - CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow - xtask: Implement memory-ceiling command with peak RSS sampling - Add perf fixtures (100-page, 10k-page) for memory testing - Add run-fuzz-with-limits.sh for local fuzz testing with memory caps - Register perf fixtures in PROVENANCE.md Memory budgets enforced: - Buffered 100-page PDF: < 512 MB - Streaming mode: < 256 MB (constant in page count) - Adversarial fixtures: < 1 GB hard ceiling Closes bf-1g1fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.4 KiB
4.4 KiB
Memory Ceiling Gate Implementation (bf-1g1fd)
Summary
Implemented a Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. The gate samples peak RSS while extracting perf + malformed corpora and fails the build if any document exceeds its budget.
Changes Made
1. Expanded xtask memory-ceiling command
File: xtask/src/main.rs
- Added support for three memory budget categories:
- Buffered 100-page vector PDF: 512 MB
- Streaming/NDJSON mode (any page count): 256 MB
- Adversarial fixtures: 1 GB hard ceiling
- Added streaming mode testing with
--format ndjson - Generates JSON report (
memory-report.json) with:- Per-document results (peak RSS, duration, budget, pass/fail)
- Summary statistics
- Commit SHA for historical tracking
- Added
MemoryTestResult,MemoryReport,MemoryBudgetJson,MemorySummarystructs
File: xtask/Cargo.toml
- Added
serde_jsondependency for JSON output - Added
humantimedependency for timestamp formatting
2. Updated CI memory-ceiling template
File: .ci/argo-workflows/pdftract-ci.yaml
- Added cgroup MemoryMax enforcement (1.5 GB cap) for clean failure mode
- Supports both cgroup v2 (preferred) and cgroup v1
- Falls back gracefully when cgroup unavailable
- Uses xtask-generated
memory-report.jsonfor artifact upload - Shows summary from report in CI logs
3. Updated fuzz workflow with cgroup enforcement
File: .ci/argo-workflows/pdftract-nightly-fuzz.yaml
- Added cgroup MemoryMax enforcement (1.5 GB cap) to fuzz-target template
- Layered memory enforcement:
- Cgroup MemoryMax: 1536 MB (hard ceiling on entire fuzz run)
- Libfuzzer
-rss_limit_mb=1024(per-execution RSS cap) - Libfuzzer
-malloc_limit_mb=1024(total malloc cap)
- Supports both cgroup v2 (preferred) and cgroup v1
- Falls back to libfuzzer limits when cgroup unavailable
Acceptance Criteria
PASS
- Harness samples peak RSS while extracting perf + malformed corpora
- Build fails if any document exceeds its memory budget
- Test suite runs under cgroup MemoryMax cap (1.5 GB)
- Fuzz suite runs under cgroup MemoryMax cap (1.5 GB)
- Libfuzzer
-rss_limit_mb=1024and-malloc_limit_mb=1024set - Memory targets are now Tier-1 gates
WARN (environmental issues)
None - all infrastructure (cgroups, libfuzzer limits) is standard CI environment
FAIL
None
Implementation Notes
Cgroup Support
The implementation supports both cgroup v2 (preferred) and cgroup v1:
- Cgroup v2: Uses
/sys/fs/cgroup/withmemory.maxcontroller - Cgroup v1: Uses
/sys/fs/cgroup/memory/withmemory.limit_in_bytes - Falls back to libfuzzer limits when cgroup unavailable
Memory Budgets
Per plan.md line 72-80:
| Category | Budget | Measurement |
|---|---|---|
| Peak RSS, 100-page vector PDF (buffered mode) | < 512 MB | tests/fixtures/perf/ |
| Peak RSS, streaming/NDJSON mode (any page count) | < 256 MB | tests/fixtures/perf/ with --format ndjson |
| Peak RSS, adversarial fixtures | < 1 GB | tests/fixtures/malformed/ |
RSS Sampling
The xtask measure_extraction function:
- Spawns pdftract as a child process
- Samples
/proc/[pid]/statusevery 10 ms forVmRSSfield - Tracks peak RSS across the extraction run
- Works on Linux; falls back to time-only measurement on other platforms
JSON Report Format
The memory-report.json artifact includes:
{
"timestamp": "2026-05-23T12:34:56Z",
"commit_sha": "abc123...",
"budgets": {
"buffered_100_page_mb": 512,
"streaming_any_mb": 256,
"adversarial_hard_cap_mb": 1024
},
"results": [
{
"file_name": "example.pdf",
"category": "buffered",
"peak_rss_mb": 123,
"duration_ms": 456,
"budget_mb": 512,
"passed": true,
"error_message": null
}
],
"summary": {
"total_tests": 10,
"passed": 10,
"failed": 0,
"all_passed": true
}
}
Testing
To test locally:
# Run memory ceiling tests
cargo run --release --bin xtask -- memory-ceiling
# Run fuzz tests with memory limits
bash scripts/run-fuzz-with-limits.sh [target]
References
- Plan section: Phase 0.4 Quality Targets - Memory targets (lines 72-80)
- Bead: bf-1g1fd
- CI template:
.ci/argo-workflows/pdftract-ci.yaml(memory-ceiling template) - Fuzz workflow:
.ci/argo-workflows/pdftract-nightly-fuzz.yaml(fuzz-target template)