pdftract/notes/bf-1g1fd.md
jedarden c621947686 feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement
Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF
extraction, analogous to cargo-bloat for binary size.

Changes:
- CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB)
- CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB)
- CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow
- xtask: Implement memory-ceiling command with peak RSS sampling
- Add perf fixtures (100-page, 10k-page) for memory testing
- Add run-fuzz-with-limits.sh for local fuzz testing with memory caps
- Register perf fixtures in PROVENANCE.md

Memory budgets enforced:
- Buffered 100-page PDF: < 512 MB
- Streaming mode: < 256 MB (constant in page count)
- Adversarial fixtures: < 1 GB hard ceiling

Closes bf-1g1fd

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:22:55 -04:00

145 lines
4.4 KiB
Markdown

# Memory Ceiling Gate Implementation (bf-1g1fd)
## Summary
Implemented a Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. The gate samples peak RSS while extracting perf + malformed corpora and fails the build if any document exceeds its budget.
## Changes Made
### 1. Expanded xtask memory-ceiling command
**File:** `xtask/src/main.rs`
- Added support for three memory budget categories:
- Buffered 100-page vector PDF: 512 MB
- Streaming/NDJSON mode (any page count): 256 MB
- Adversarial fixtures: 1 GB hard ceiling
- Added streaming mode testing with `--format ndjson`
- Generates JSON report (`memory-report.json`) with:
- Per-document results (peak RSS, duration, budget, pass/fail)
- Summary statistics
- Commit SHA for historical tracking
- Added `MemoryTestResult`, `MemoryReport`, `MemoryBudgetJson`, `MemorySummary` structs
**File:** `xtask/Cargo.toml`
- Added `serde_json` dependency for JSON output
- Added `humantime` dependency for timestamp formatting
### 2. Updated CI memory-ceiling template
**File:** `.ci/argo-workflows/pdftract-ci.yaml`
- Added cgroup MemoryMax enforcement (1.5 GB cap) for clean failure mode
- Supports both cgroup v2 (preferred) and cgroup v1
- Falls back gracefully when cgroup unavailable
- Uses xtask-generated `memory-report.json` for artifact upload
- Shows summary from report in CI logs
### 3. Updated fuzz workflow with cgroup enforcement
**File:** `.ci/argo-workflows/pdftract-nightly-fuzz.yaml`
- Added cgroup MemoryMax enforcement (1.5 GB cap) to fuzz-target template
- Layered memory enforcement:
- Cgroup MemoryMax: 1536 MB (hard ceiling on entire fuzz run)
- Libfuzzer `-rss_limit_mb=1024` (per-execution RSS cap)
- Libfuzzer `-malloc_limit_mb=1024` (total malloc cap)
- Supports both cgroup v2 (preferred) and cgroup v1
- Falls back to libfuzzer limits when cgroup unavailable
## Acceptance Criteria
### PASS
- [x] Harness samples peak RSS while extracting perf + malformed corpora
- [x] Build fails if any document exceeds its memory budget
- [x] Test suite runs under cgroup MemoryMax cap (1.5 GB)
- [x] Fuzz suite runs under cgroup MemoryMax cap (1.5 GB)
- [x] Libfuzzer `-rss_limit_mb=1024` and `-malloc_limit_mb=1024` set
- [x] Memory targets are now Tier-1 gates
### WARN (environmental issues)
None - all infrastructure (cgroups, libfuzzer limits) is standard CI environment
### FAIL
None
## Implementation Notes
### Cgroup Support
The implementation supports both cgroup v2 (preferred) and cgroup v1:
- Cgroup v2: Uses `/sys/fs/cgroup/` with `memory.max` controller
- Cgroup v1: Uses `/sys/fs/cgroup/memory/` with `memory.limit_in_bytes`
- Falls back to libfuzzer limits when cgroup unavailable
### Memory Budgets
Per plan.md line 72-80:
| Category | Budget | Measurement |
|----------|--------|-------------|
| Peak RSS, 100-page vector PDF (buffered mode) | < 512 MB | `tests/fixtures/perf/` |
| Peak RSS, streaming/NDJSON mode (any page count) | < 256 MB | `tests/fixtures/perf/` with `--format ndjson` |
| Peak RSS, adversarial fixtures | < 1 GB | `tests/fixtures/malformed/` |
### RSS Sampling
The xtask `measure_extraction` function:
- Spawns pdftract as a child process
- Samples `/proc/[pid]/status` every 10 ms for `VmRSS` field
- Tracks peak RSS across the extraction run
- Works on Linux; falls back to time-only measurement on other platforms
### JSON Report Format
The `memory-report.json` artifact includes:
```json
{
"timestamp": "2026-05-23T12:34:56Z",
"commit_sha": "abc123...",
"budgets": {
"buffered_100_page_mb": 512,
"streaming_any_mb": 256,
"adversarial_hard_cap_mb": 1024
},
"results": [
{
"file_name": "example.pdf",
"category": "buffered",
"peak_rss_mb": 123,
"duration_ms": 456,
"budget_mb": 512,
"passed": true,
"error_message": null
}
],
"summary": {
"total_tests": 10,
"passed": 10,
"failed": 0,
"all_passed": true
}
}
```
## Testing
To test locally:
```bash
# Run memory ceiling tests
cargo run --release --bin xtask -- memory-ceiling
# Run fuzz tests with memory limits
bash scripts/run-fuzz-with-limits.sh [target]
```
## References
- Plan section: Phase 0.4 Quality Targets - Memory targets (lines 72-80)
- Bead: bf-1g1fd
- CI template: `.ci/argo-workflows/pdftract-ci.yaml` (memory-ceiling template)
- Fuzz workflow: `.ci/argo-workflows/pdftract-nightly-fuzz.yaml` (fuzz-target template)