pdftract/notes/bf-1g1fd.md
jedarden c621947686 feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement
Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF
extraction, analogous to cargo-bloat for binary size.

Changes:
- CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB)
- CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB)
- CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow
- xtask: Implement memory-ceiling command with peak RSS sampling
- Add perf fixtures (100-page, 10k-page) for memory testing
- Add run-fuzz-with-limits.sh for local fuzz testing with memory caps
- Register perf fixtures in PROVENANCE.md

Memory budgets enforced:
- Buffered 100-page PDF: < 512 MB
- Streaming mode: < 256 MB (constant in page count)
- Adversarial fixtures: < 1 GB hard ceiling

Closes bf-1g1fd

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:22:55 -04:00

4.4 KiB

Memory Ceiling Gate Implementation (bf-1g1fd)

Summary

Implemented a Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. The gate samples peak RSS while extracting perf + malformed corpora and fails the build if any document exceeds its budget.

Changes Made

1. Expanded xtask memory-ceiling command

File: xtask/src/main.rs

  • Added support for three memory budget categories:
    • Buffered 100-page vector PDF: 512 MB
    • Streaming/NDJSON mode (any page count): 256 MB
    • Adversarial fixtures: 1 GB hard ceiling
  • Added streaming mode testing with --format ndjson
  • Generates JSON report (memory-report.json) with:
    • Per-document results (peak RSS, duration, budget, pass/fail)
    • Summary statistics
    • Commit SHA for historical tracking
  • Added MemoryTestResult, MemoryReport, MemoryBudgetJson, MemorySummary structs

File: xtask/Cargo.toml

  • Added serde_json dependency for JSON output
  • Added humantime dependency for timestamp formatting

2. Updated CI memory-ceiling template

File: .ci/argo-workflows/pdftract-ci.yaml

  • Added cgroup MemoryMax enforcement (1.5 GB cap) for clean failure mode
  • Supports both cgroup v2 (preferred) and cgroup v1
  • Falls back gracefully when cgroup unavailable
  • Uses xtask-generated memory-report.json for artifact upload
  • Shows summary from report in CI logs

3. Updated fuzz workflow with cgroup enforcement

File: .ci/argo-workflows/pdftract-nightly-fuzz.yaml

  • Added cgroup MemoryMax enforcement (1.5 GB cap) to fuzz-target template
  • Layered memory enforcement:
    • Cgroup MemoryMax: 1536 MB (hard ceiling on entire fuzz run)
    • Libfuzzer -rss_limit_mb=1024 (per-execution RSS cap)
    • Libfuzzer -malloc_limit_mb=1024 (total malloc cap)
  • Supports both cgroup v2 (preferred) and cgroup v1
  • Falls back to libfuzzer limits when cgroup unavailable

Acceptance Criteria

PASS

  • Harness samples peak RSS while extracting perf + malformed corpora
  • Build fails if any document exceeds its memory budget
  • Test suite runs under cgroup MemoryMax cap (1.5 GB)
  • Fuzz suite runs under cgroup MemoryMax cap (1.5 GB)
  • Libfuzzer -rss_limit_mb=1024 and -malloc_limit_mb=1024 set
  • Memory targets are now Tier-1 gates

WARN (environmental issues)

None - all infrastructure (cgroups, libfuzzer limits) is standard CI environment

FAIL

None

Implementation Notes

Cgroup Support

The implementation supports both cgroup v2 (preferred) and cgroup v1:

  • Cgroup v2: Uses /sys/fs/cgroup/ with memory.max controller
  • Cgroup v1: Uses /sys/fs/cgroup/memory/ with memory.limit_in_bytes
  • Falls back to libfuzzer limits when cgroup unavailable

Memory Budgets

Per plan.md line 72-80:

Category Budget Measurement
Peak RSS, 100-page vector PDF (buffered mode) < 512 MB tests/fixtures/perf/
Peak RSS, streaming/NDJSON mode (any page count) < 256 MB tests/fixtures/perf/ with --format ndjson
Peak RSS, adversarial fixtures < 1 GB tests/fixtures/malformed/

RSS Sampling

The xtask measure_extraction function:

  • Spawns pdftract as a child process
  • Samples /proc/[pid]/status every 10 ms for VmRSS field
  • Tracks peak RSS across the extraction run
  • Works on Linux; falls back to time-only measurement on other platforms

JSON Report Format

The memory-report.json artifact includes:

{
  "timestamp": "2026-05-23T12:34:56Z",
  "commit_sha": "abc123...",
  "budgets": {
    "buffered_100_page_mb": 512,
    "streaming_any_mb": 256,
    "adversarial_hard_cap_mb": 1024
  },
  "results": [
    {
      "file_name": "example.pdf",
      "category": "buffered",
      "peak_rss_mb": 123,
      "duration_ms": 456,
      "budget_mb": 512,
      "passed": true,
      "error_message": null
    }
  ],
  "summary": {
    "total_tests": 10,
    "passed": 10,
    "failed": 0,
    "all_passed": true
  }
}

Testing

To test locally:

# Run memory ceiling tests
cargo run --release --bin xtask -- memory-ceiling

# Run fuzz tests with memory limits
bash scripts/run-fuzz-with-limits.sh [target]

References

  • Plan section: Phase 0.4 Quality Targets - Memory targets (lines 72-80)
  • Bead: bf-1g1fd
  • CI template: .ci/argo-workflows/pdftract-ci.yaml (memory-ceiling template)
  • Fuzz workflow: .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz-target template)