FABRIC/docs/memory-audit-bd-ch6.7.md
jedarden 6b39dae283 feat(memory): add heap diff analysis and leak detection utilities
- Add src/heapDiff.ts: utilities for comparing heap snapshots and analyzing trends
- Add API endpoints: /api/memory/diff-analysis, /api/memory/trend, /api/memory/trend.md
- Add docs/memory-audit-bd-ch6.7.md: comprehensive audit findings

Audit findings:
- Event store well-bounded with proper cleanup (1h stale worker, 5min collision timeout)
- WebSocket broadcast has backpressure handling (1MB buffer limit)
- Parser uses native JSON.parse(), no regex issues
- Heap snapshots already configured (30min intervals, 1GB heap limit)
- No unbounded growth identified in core data structures

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 14:05:39 -04:00

5.8 KiB

Memory Profiling / Leak Hunt Audit Summary

Task: bd-ch6.7 - Memory profiling / leak hunt for fabric-web under production load Date: 2026-04-28 Environment: fabric-web systemd service on Hetzner EX44

Executive Summary

The fabric-web service was audited for memory leaks and unbounded growth. The audit found:

  1. Heap snapshots and memory limits already configured - systemd unit has --max-old-space-size=1024 and 30-min heap snapshot intervals
  2. Event store is well-bounded - All collections have caps and LRU eviction
  3. WebSocket broadcast has backpressure handling - Clients with >1MB send buffer are terminated
  4. No obvious parser hot paths - JSON parsing uses native JSON.parse(), no regex issues
  5. New heap diff analysis utilities added - API endpoints for comparing snapshots and identifying leaks

Findings by Component

1. Systemd Configuration (scripts/fabric-web.service)

Status: Already configured correctly

  • --max-old-space-size=1024 limits V8 heap to 1GB
  • --heap-snapshots --snapshot-interval 30 enables automatic heap snapshots every 30 minutes
  • MemoryMax=1536M and MemoryHigh=1200M provide systemd-level memory limits

2. Event Store (src/store.ts - InMemoryEventStore)

Status: Well-bounded with proper cleanup

Structure Cap Cleanup Mechanism
events 10,000 Batch trimming (removes 100 at a time when over cap)
sequenceIndex bounded by events Pruned when events are trimmed
workers time-based 1-hour stale worker cleanup
collisions time-based 5-minute stale collision cleanup
fileModifications 10,000 LRU eviction
recentFileMods 50,000 total, 100 per file LRU eviction
taskStartTimes time-based 24-hour timeout

Minor observation: The workers.activeFiles and workers.activeDirectories arrays are bounded at 200 entries per worker. With many workers, this could accumulate, but the 1-hour stale cleanup mitigates this.

3. WebSocket Broadcast (src/web/server.ts)

Status: Backpressure handling in place

  • Single JSON.stringify() for all clients (amortizes serialization cost)
  • WS_MAX_BUFFERED_BYTES = 1 MB - clients exceeding this are terminated
  • Terminated clients are immediately cleaned up from the clients Set

No unbounded growth risk identified.

4. Parser & Normalizer (src/parser.ts, src/normalizer.ts)

Status: No obvious hot paths

  • JSON parsing uses native JSON.parse() (highly optimized)
  • No regex patterns that could cause catastrophic backtracking
  • EventDeduplicator has bounded LRU cache (10,000 entries)
  • String operations are simple splits and lookups

No allocator hot paths identified.

5. Directory Tailer (src/directoryTailer.ts)

Status: Well-bounded (bd-ch6.1 improvements)

  • maxActiveFiles = 200 limits concurrent file watchers
  • maxFileInfoEntries = 50000 with LRU eviction
  • maxRssBytes = 400 MB backpressure skips activation under memory pressure

New Features Added

Heap Diff Analysis (src/heapDiff.ts)

New utilities for analyzing heap snapshots:

  • getHeapSnapshots() - List all snapshots sorted by timestamp
  • compareSnapshots(baseline, current) - Compare two snapshots
  • getRecentHeapDiff() - Get diff comparing oldest vs newest recent snapshot
  • analyzeTrend() - Analyze growth across all snapshots
  • formatTrendAsMarkdown() - Generate markdown reports
  • saveTrendReport() - Save trend analysis to disk

New API Endpoints

GET  /api/memory/diff-analysis  - Get recent heap diff
GET  /api/memory/trend          - Get full trend analysis
GET  /api/memory/trend.md       - Get trend as markdown
POST /api/memory/trend/save     - Save trend report to disk

Recommendations

For Production Monitoring

  1. Monitor heap diff API - Set up alerts for assessment: "leaking"
  2. Review heap snapshots in Chrome DevTools - When leak is detected, load .heapsnapshot files to identify growing retainers
  3. Check trend reports - The /api/memory/trend.md endpoint provides a human-readable summary

Potential Improvements (Future Work)

  1. V8 heap fragmentation - The 2GB RSS with 1GB heap limit suggests fragmentation. Consider:

    • More frequent heap snapshots (10-15 min intervals) for finer granularity
    • Node.js --expose-gc flag for manual GC during low-traffic periods
  2. Manager instance cleanup - The store holds references to multiple manager instances:

    • errorGroupManager
    • recoveryManager
    • crossReferenceManager
    • workerAnalytics
    • semanticNarrativeManager
    • historicalStore

    These managers should be audited for internal bounds, especially crossReferenceManager and semanticNarrativeManager.

  3. WebSocket connection storms - While backpressure exists, a sudden influx of clients could cause memory spikes. Consider:

    • Maximum client cap
    • Connection rate limiting

Exit Criteria Status

  • Run with --max-old-space-size=1024 - Already configured in systemd
  • Capture v8.getHeapSnapshot() at 30 min intervals - Already implemented
  • Diff snapshots to identify growing retainers - New heap diff utilities added
  • Audit src/store.ts for unbounded arrays/maps - All structures bounded
  • Confirm ring-buffer behavior on event history - Batch trimming implemented
  • Audit WebSocket broadcast for backpressure - 1MB buffer limit with termination
  • Audit parser for regex/allocator hot paths - No issues found
  • Heap + RSS stable under steady-state load for 24h - Requires production monitoring

Next Steps

  1. Deploy the updated code with heap diff analysis
  2. Monitor /api/memory/diff-analysis over 24-48 hours
  3. If leak is detected, download heap snapshots and analyze in Chrome DevTools
  4. Based on findings, implement targeted fixes