FABRIC/docs/memory-audit-bd-ch6.7.md
jedarden 6b39dae283 feat(memory): add heap diff analysis and leak detection utilities
- Add src/heapDiff.ts: utilities for comparing heap snapshots and analyzing trends
- Add API endpoints: /api/memory/diff-analysis, /api/memory/trend, /api/memory/trend.md
- Add docs/memory-audit-bd-ch6.7.md: comprehensive audit findings

Audit findings:
- Event store well-bounded with proper cleanup (1h stale worker, 5min collision timeout)
- WebSocket broadcast has backpressure handling (1MB buffer limit)
- Parser uses native JSON.parse(), no regex issues
- Heap snapshots already configured (30min intervals, 1GB heap limit)
- No unbounded growth identified in core data structures

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 14:05:39 -04:00

138 lines
5.8 KiB
Markdown

# Memory Profiling / Leak Hunt Audit Summary
**Task:** bd-ch6.7 - Memory profiling / leak hunt for fabric-web under production load
**Date:** 2026-04-28
**Environment:** fabric-web systemd service on Hetzner EX44
## Executive Summary
The fabric-web service was audited for memory leaks and unbounded growth. The audit found:
1. **Heap snapshots and memory limits already configured** - systemd unit has `--max-old-space-size=1024` and 30-min heap snapshot intervals
2. **Event store is well-bounded** - All collections have caps and LRU eviction
3. **WebSocket broadcast has backpressure handling** - Clients with >1MB send buffer are terminated
4. **No obvious parser hot paths** - JSON parsing uses native `JSON.parse()`, no regex issues
5. **New heap diff analysis utilities added** - API endpoints for comparing snapshots and identifying leaks
## Findings by Component
### 1. Systemd Configuration (`scripts/fabric-web.service`)
**Status:** ✅ Already configured correctly
- `--max-old-space-size=1024` limits V8 heap to 1GB
- `--heap-snapshots --snapshot-interval 30` enables automatic heap snapshots every 30 minutes
- `MemoryMax=1536M` and `MemoryHigh=1200M` provide systemd-level memory limits
### 2. Event Store (`src/store.ts` - InMemoryEventStore)
**Status:** ✅ Well-bounded with proper cleanup
| Structure | Cap | Cleanup Mechanism |
|-----------|-----|-------------------|
| `events` | 10,000 | Batch trimming (removes 100 at a time when over cap) |
| `sequenceIndex` | bounded by events | Pruned when events are trimmed |
| `workers` | time-based | 1-hour stale worker cleanup |
| `collisions` | time-based | 5-minute stale collision cleanup |
| `fileModifications` | 10,000 | LRU eviction |
| `recentFileMods` | 50,000 total, 100 per file | LRU eviction |
| `taskStartTimes` | time-based | 24-hour timeout |
**Minor observation:** The `workers.activeFiles` and `workers.activeDirectories` arrays are bounded at 200 entries per worker. With many workers, this could accumulate, but the 1-hour stale cleanup mitigates this.
### 3. WebSocket Broadcast (`src/web/server.ts`)
**Status:** ✅ Backpressure handling in place
- Single `JSON.stringify()` for all clients (amortizes serialization cost)
- `WS_MAX_BUFFERED_BYTES = 1 MB` - clients exceeding this are terminated
- Terminated clients are immediately cleaned up from the `clients` Set
**No unbounded growth risk identified.**
### 4. Parser & Normalizer (`src/parser.ts`, `src/normalizer.ts`)
**Status:** ✅ No obvious hot paths
- JSON parsing uses native `JSON.parse()` (highly optimized)
- No regex patterns that could cause catastrophic backtracking
- `EventDeduplicator` has bounded LRU cache (10,000 entries)
- String operations are simple splits and lookups
**No allocator hot paths identified.**
### 5. Directory Tailer (`src/directoryTailer.ts`)
**Status:** ✅ Well-bounded (bd-ch6.1 improvements)
- `maxActiveFiles = 200` limits concurrent file watchers
- `maxFileInfoEntries = 50000` with LRU eviction
- `maxRssBytes = 400 MB` backpressure skips activation under memory pressure
## New Features Added
### Heap Diff Analysis (`src/heapDiff.ts`)
New utilities for analyzing heap snapshots:
- `getHeapSnapshots()` - List all snapshots sorted by timestamp
- `compareSnapshots(baseline, current)` - Compare two snapshots
- `getRecentHeapDiff()` - Get diff comparing oldest vs newest recent snapshot
- `analyzeTrend()` - Analyze growth across all snapshots
- `formatTrendAsMarkdown()` - Generate markdown reports
- `saveTrendReport()` - Save trend analysis to disk
### New API Endpoints
```
GET /api/memory/diff-analysis - Get recent heap diff
GET /api/memory/trend - Get full trend analysis
GET /api/memory/trend.md - Get trend as markdown
POST /api/memory/trend/save - Save trend report to disk
```
## Recommendations
### For Production Monitoring
1. **Monitor heap diff API** - Set up alerts for `assessment: "leaking"`
2. **Review heap snapshots in Chrome DevTools** - When leak is detected, load `.heapsnapshot` files to identify growing retainers
3. **Check trend reports** - The `/api/memory/trend.md` endpoint provides a human-readable summary
### Potential Improvements (Future Work)
1. **V8 heap fragmentation** - The 2GB RSS with 1GB heap limit suggests fragmentation. Consider:
- More frequent heap snapshots (10-15 min intervals) for finer granularity
- Node.js `--expose-gc` flag for manual GC during low-traffic periods
2. **Manager instance cleanup** - The store holds references to multiple manager instances:
- `errorGroupManager`
- `recoveryManager`
- `crossReferenceManager`
- `workerAnalytics`
- `semanticNarrativeManager`
- `historicalStore`
These managers should be audited for internal bounds, especially `crossReferenceManager` and `semanticNarrativeManager`.
3. **WebSocket connection storms** - While backpressure exists, a sudden influx of clients could cause memory spikes. Consider:
- Maximum client cap
- Connection rate limiting
## Exit Criteria Status
- ✅ Run with `--max-old-space-size=1024` - Already configured in systemd
- ✅ Capture `v8.getHeapSnapshot()` at 30 min intervals - Already implemented
- ✅ Diff snapshots to identify growing retainers - New heap diff utilities added
- ✅ Audit `src/store.ts` for unbounded arrays/maps - All structures bounded
- ✅ Confirm ring-buffer behavior on event history - Batch trimming implemented
- ✅ Audit WebSocket broadcast for backpressure - 1MB buffer limit with termination
- ✅ Audit parser for regex/allocator hot paths - No issues found
- ⏳ Heap + RSS stable under steady-state load for 24h - Requires production monitoring
## Next Steps
1. Deploy the updated code with heap diff analysis
2. Monitor `/api/memory/diff-analysis` over 24-48 hours
3. If leak is detected, download heap snapshots and analyze in Chrome DevTools
4. Based on findings, implement targeted fixes