Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. Changes: - CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB) - CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB) - CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow - xtask: Implement memory-ceiling command with peak RSS sampling - Add perf fixtures (100-page, 10k-page) for memory testing - Add run-fuzz-with-limits.sh for local fuzz testing with memory caps - Register perf fixtures in PROVENANCE.md Memory budgets enforced: - Buffered 100-page PDF: < 512 MB - Streaming mode: < 256 MB (constant in page count) - Adversarial fixtures: < 1 GB hard ceiling Closes bf-1g1fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
145 lines
4.4 KiB
Markdown
145 lines
4.4 KiB
Markdown
# Memory Ceiling Gate Implementation (bf-1g1fd)
|
|
|
|
## Summary
|
|
|
|
Implemented a Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. The gate samples peak RSS while extracting perf + malformed corpora and fails the build if any document exceeds its budget.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Expanded xtask memory-ceiling command
|
|
|
|
**File:** `xtask/src/main.rs`
|
|
|
|
- Added support for three memory budget categories:
|
|
- Buffered 100-page vector PDF: 512 MB
|
|
- Streaming/NDJSON mode (any page count): 256 MB
|
|
- Adversarial fixtures: 1 GB hard ceiling
|
|
- Added streaming mode testing with `--format ndjson`
|
|
- Generates JSON report (`memory-report.json`) with:
|
|
- Per-document results (peak RSS, duration, budget, pass/fail)
|
|
- Summary statistics
|
|
- Commit SHA for historical tracking
|
|
- Added `MemoryTestResult`, `MemoryReport`, `MemoryBudgetJson`, `MemorySummary` structs
|
|
|
|
**File:** `xtask/Cargo.toml`
|
|
|
|
- Added `serde_json` dependency for JSON output
|
|
- Added `humantime` dependency for timestamp formatting
|
|
|
|
### 2. Updated CI memory-ceiling template
|
|
|
|
**File:** `.ci/argo-workflows/pdftract-ci.yaml`
|
|
|
|
- Added cgroup MemoryMax enforcement (1.5 GB cap) for clean failure mode
|
|
- Supports both cgroup v2 (preferred) and cgroup v1
|
|
- Falls back gracefully when cgroup unavailable
|
|
- Uses xtask-generated `memory-report.json` for artifact upload
|
|
- Shows summary from report in CI logs
|
|
|
|
### 3. Updated fuzz workflow with cgroup enforcement
|
|
|
|
**File:** `.ci/argo-workflows/pdftract-nightly-fuzz.yaml`
|
|
|
|
- Added cgroup MemoryMax enforcement (1.5 GB cap) to fuzz-target template
|
|
- Layered memory enforcement:
|
|
- Cgroup MemoryMax: 1536 MB (hard ceiling on entire fuzz run)
|
|
- Libfuzzer `-rss_limit_mb=1024` (per-execution RSS cap)
|
|
- Libfuzzer `-malloc_limit_mb=1024` (total malloc cap)
|
|
- Supports both cgroup v2 (preferred) and cgroup v1
|
|
- Falls back to libfuzzer limits when cgroup unavailable
|
|
|
|
## Acceptance Criteria
|
|
|
|
### PASS
|
|
|
|
- [x] Harness samples peak RSS while extracting perf + malformed corpora
|
|
- [x] Build fails if any document exceeds its memory budget
|
|
- [x] Test suite runs under cgroup MemoryMax cap (1.5 GB)
|
|
- [x] Fuzz suite runs under cgroup MemoryMax cap (1.5 GB)
|
|
- [x] Libfuzzer `-rss_limit_mb=1024` and `-malloc_limit_mb=1024` set
|
|
- [x] Memory targets are now Tier-1 gates
|
|
|
|
### WARN (environmental issues)
|
|
|
|
None - all infrastructure (cgroups, libfuzzer limits) is standard CI environment
|
|
|
|
### FAIL
|
|
|
|
None
|
|
|
|
## Implementation Notes
|
|
|
|
### Cgroup Support
|
|
|
|
The implementation supports both cgroup v2 (preferred) and cgroup v1:
|
|
- Cgroup v2: Uses `/sys/fs/cgroup/` with `memory.max` controller
|
|
- Cgroup v1: Uses `/sys/fs/cgroup/memory/` with `memory.limit_in_bytes`
|
|
- Falls back to libfuzzer limits when cgroup unavailable
|
|
|
|
### Memory Budgets
|
|
|
|
Per plan.md line 72-80:
|
|
|
|
| Category | Budget | Measurement |
|
|
|----------|--------|-------------|
|
|
| Peak RSS, 100-page vector PDF (buffered mode) | < 512 MB | `tests/fixtures/perf/` |
|
|
| Peak RSS, streaming/NDJSON mode (any page count) | < 256 MB | `tests/fixtures/perf/` with `--format ndjson` |
|
|
| Peak RSS, adversarial fixtures | < 1 GB | `tests/fixtures/malformed/` |
|
|
|
|
### RSS Sampling
|
|
|
|
The xtask `measure_extraction` function:
|
|
- Spawns pdftract as a child process
|
|
- Samples `/proc/[pid]/status` every 10 ms for `VmRSS` field
|
|
- Tracks peak RSS across the extraction run
|
|
- Works on Linux; falls back to time-only measurement on other platforms
|
|
|
|
### JSON Report Format
|
|
|
|
The `memory-report.json` artifact includes:
|
|
```json
|
|
{
|
|
"timestamp": "2026-05-23T12:34:56Z",
|
|
"commit_sha": "abc123...",
|
|
"budgets": {
|
|
"buffered_100_page_mb": 512,
|
|
"streaming_any_mb": 256,
|
|
"adversarial_hard_cap_mb": 1024
|
|
},
|
|
"results": [
|
|
{
|
|
"file_name": "example.pdf",
|
|
"category": "buffered",
|
|
"peak_rss_mb": 123,
|
|
"duration_ms": 456,
|
|
"budget_mb": 512,
|
|
"passed": true,
|
|
"error_message": null
|
|
}
|
|
],
|
|
"summary": {
|
|
"total_tests": 10,
|
|
"passed": 10,
|
|
"failed": 0,
|
|
"all_passed": true
|
|
}
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
To test locally:
|
|
```bash
|
|
# Run memory ceiling tests
|
|
cargo run --release --bin xtask -- memory-ceiling
|
|
|
|
# Run fuzz tests with memory limits
|
|
bash scripts/run-fuzz-with-limits.sh [target]
|
|
```
|
|
|
|
## References
|
|
|
|
- Plan section: Phase 0.4 Quality Targets - Memory targets (lines 72-80)
|
|
- Bead: bf-1g1fd
|
|
- CI template: `.ci/argo-workflows/pdftract-ci.yaml` (memory-ceiling template)
|
|
- Fuzz workflow: `.ci/argo-workflows/pdftract-nightly-fuzz.yaml` (fuzz-target template)
|