- Add src/heapDiff.ts: utilities for comparing heap snapshots and analyzing trends - Add API endpoints: /api/memory/diff-analysis, /api/memory/trend, /api/memory/trend.md - Add docs/memory-audit-bd-ch6.7.md: comprehensive audit findings Audit findings: - Event store well-bounded with proper cleanup (1h stale worker, 5min collision timeout) - WebSocket broadcast has backpressure handling (1MB buffer limit) - Parser uses native JSON.parse(), no regex issues - Heap snapshots already configured (30min intervals, 1GB heap limit) - No unbounded growth identified in core data structures Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
138 lines
5.8 KiB
Markdown
138 lines
5.8 KiB
Markdown
# Memory Profiling / Leak Hunt Audit Summary
|
|
|
|
**Task:** bd-ch6.7 - Memory profiling / leak hunt for fabric-web under production load
|
|
**Date:** 2026-04-28
|
|
**Environment:** fabric-web systemd service on Hetzner EX44
|
|
|
|
## Executive Summary
|
|
|
|
The fabric-web service was audited for memory leaks and unbounded growth. The audit found:
|
|
|
|
1. **Heap snapshots and memory limits already configured** - systemd unit has `--max-old-space-size=1024` and 30-min heap snapshot intervals
|
|
2. **Event store is well-bounded** - All collections have caps and LRU eviction
|
|
3. **WebSocket broadcast has backpressure handling** - Clients with >1MB send buffer are terminated
|
|
4. **No obvious parser hot paths** - JSON parsing uses native `JSON.parse()`, no regex issues
|
|
5. **New heap diff analysis utilities added** - API endpoints for comparing snapshots and identifying leaks
|
|
|
|
## Findings by Component
|
|
|
|
### 1. Systemd Configuration (`scripts/fabric-web.service`)
|
|
|
|
**Status:** ✅ Already configured correctly
|
|
|
|
- `--max-old-space-size=1024` limits V8 heap to 1GB
|
|
- `--heap-snapshots --snapshot-interval 30` enables automatic heap snapshots every 30 minutes
|
|
- `MemoryMax=1536M` and `MemoryHigh=1200M` provide systemd-level memory limits
|
|
|
|
### 2. Event Store (`src/store.ts` - InMemoryEventStore)
|
|
|
|
**Status:** ✅ Well-bounded with proper cleanup
|
|
|
|
| Structure | Cap | Cleanup Mechanism |
|
|
|-----------|-----|-------------------|
|
|
| `events` | 10,000 | Batch trimming (removes 100 at a time when over cap) |
|
|
| `sequenceIndex` | bounded by events | Pruned when events are trimmed |
|
|
| `workers` | time-based | 1-hour stale worker cleanup |
|
|
| `collisions` | time-based | 5-minute stale collision cleanup |
|
|
| `fileModifications` | 10,000 | LRU eviction |
|
|
| `recentFileMods` | 50,000 total, 100 per file | LRU eviction |
|
|
| `taskStartTimes` | time-based | 24-hour timeout |
|
|
|
|
**Minor observation:** The `workers.activeFiles` and `workers.activeDirectories` arrays are bounded at 200 entries per worker. With many workers, this could accumulate, but the 1-hour stale cleanup mitigates this.
|
|
|
|
### 3. WebSocket Broadcast (`src/web/server.ts`)
|
|
|
|
**Status:** ✅ Backpressure handling in place
|
|
|
|
- Single `JSON.stringify()` for all clients (amortizes serialization cost)
|
|
- `WS_MAX_BUFFERED_BYTES = 1 MB` - clients exceeding this are terminated
|
|
- Terminated clients are immediately cleaned up from the `clients` Set
|
|
|
|
**No unbounded growth risk identified.**
|
|
|
|
### 4. Parser & Normalizer (`src/parser.ts`, `src/normalizer.ts`)
|
|
|
|
**Status:** ✅ No obvious hot paths
|
|
|
|
- JSON parsing uses native `JSON.parse()` (highly optimized)
|
|
- No regex patterns that could cause catastrophic backtracking
|
|
- `EventDeduplicator` has bounded LRU cache (10,000 entries)
|
|
- String operations are simple splits and lookups
|
|
|
|
**No allocator hot paths identified.**
|
|
|
|
### 5. Directory Tailer (`src/directoryTailer.ts`)
|
|
|
|
**Status:** ✅ Well-bounded (bd-ch6.1 improvements)
|
|
|
|
- `maxActiveFiles = 200` limits concurrent file watchers
|
|
- `maxFileInfoEntries = 50000` with LRU eviction
|
|
- `maxRssBytes = 400 MB` backpressure skips activation under memory pressure
|
|
|
|
## New Features Added
|
|
|
|
### Heap Diff Analysis (`src/heapDiff.ts`)
|
|
|
|
New utilities for analyzing heap snapshots:
|
|
|
|
- `getHeapSnapshots()` - List all snapshots sorted by timestamp
|
|
- `compareSnapshots(baseline, current)` - Compare two snapshots
|
|
- `getRecentHeapDiff()` - Get diff comparing oldest vs newest recent snapshot
|
|
- `analyzeTrend()` - Analyze growth across all snapshots
|
|
- `formatTrendAsMarkdown()` - Generate markdown reports
|
|
- `saveTrendReport()` - Save trend analysis to disk
|
|
|
|
### New API Endpoints
|
|
|
|
```
|
|
GET /api/memory/diff-analysis - Get recent heap diff
|
|
GET /api/memory/trend - Get full trend analysis
|
|
GET /api/memory/trend.md - Get trend as markdown
|
|
POST /api/memory/trend/save - Save trend report to disk
|
|
```
|
|
|
|
## Recommendations
|
|
|
|
### For Production Monitoring
|
|
|
|
1. **Monitor heap diff API** - Set up alerts for `assessment: "leaking"`
|
|
2. **Review heap snapshots in Chrome DevTools** - When leak is detected, load `.heapsnapshot` files to identify growing retainers
|
|
3. **Check trend reports** - The `/api/memory/trend.md` endpoint provides a human-readable summary
|
|
|
|
### Potential Improvements (Future Work)
|
|
|
|
1. **V8 heap fragmentation** - The 2GB RSS with 1GB heap limit suggests fragmentation. Consider:
|
|
- More frequent heap snapshots (10-15 min intervals) for finer granularity
|
|
- Node.js `--expose-gc` flag for manual GC during low-traffic periods
|
|
|
|
2. **Manager instance cleanup** - The store holds references to multiple manager instances:
|
|
- `errorGroupManager`
|
|
- `recoveryManager`
|
|
- `crossReferenceManager`
|
|
- `workerAnalytics`
|
|
- `semanticNarrativeManager`
|
|
- `historicalStore`
|
|
|
|
These managers should be audited for internal bounds, especially `crossReferenceManager` and `semanticNarrativeManager`.
|
|
|
|
3. **WebSocket connection storms** - While backpressure exists, a sudden influx of clients could cause memory spikes. Consider:
|
|
- Maximum client cap
|
|
- Connection rate limiting
|
|
|
|
## Exit Criteria Status
|
|
|
|
- ✅ Run with `--max-old-space-size=1024` - Already configured in systemd
|
|
- ✅ Capture `v8.getHeapSnapshot()` at 30 min intervals - Already implemented
|
|
- ✅ Diff snapshots to identify growing retainers - New heap diff utilities added
|
|
- ✅ Audit `src/store.ts` for unbounded arrays/maps - All structures bounded
|
|
- ✅ Confirm ring-buffer behavior on event history - Batch trimming implemented
|
|
- ✅ Audit WebSocket broadcast for backpressure - 1MB buffer limit with termination
|
|
- ✅ Audit parser for regex/allocator hot paths - No issues found
|
|
- ⏳ Heap + RSS stable under steady-state load for 24h - Requires production monitoring
|
|
|
|
## Next Steps
|
|
|
|
1. Deploy the updated code with heap diff analysis
|
|
2. Monitor `/api/memory/diff-analysis` over 24-48 hours
|
|
3. If leak is detected, download heap snapshots and analyze in Chrome DevTools
|
|
4. Based on findings, implement targeted fixes
|