- Add src/heapDiff.ts: utilities for comparing heap snapshots and analyzing trends - Add API endpoints: /api/memory/diff-analysis, /api/memory/trend, /api/memory/trend.md - Add docs/memory-audit-bd-ch6.7.md: comprehensive audit findings Audit findings: - Event store well-bounded with proper cleanup (1h stale worker, 5min collision timeout) - WebSocket broadcast has backpressure handling (1MB buffer limit) - Parser uses native JSON.parse(), no regex issues - Heap snapshots already configured (30min intervals, 1GB heap limit) - No unbounded growth identified in core data structures Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.8 KiB
Memory Profiling / Leak Hunt Audit Summary
Task: bd-ch6.7 - Memory profiling / leak hunt for fabric-web under production load Date: 2026-04-28 Environment: fabric-web systemd service on Hetzner EX44
Executive Summary
The fabric-web service was audited for memory leaks and unbounded growth. The audit found:
- Heap snapshots and memory limits already configured - systemd unit has
--max-old-space-size=1024and 30-min heap snapshot intervals - Event store is well-bounded - All collections have caps and LRU eviction
- WebSocket broadcast has backpressure handling - Clients with >1MB send buffer are terminated
- No obvious parser hot paths - JSON parsing uses native
JSON.parse(), no regex issues - New heap diff analysis utilities added - API endpoints for comparing snapshots and identifying leaks
Findings by Component
1. Systemd Configuration (scripts/fabric-web.service)
Status: ✅ Already configured correctly
--max-old-space-size=1024limits V8 heap to 1GB--heap-snapshots --snapshot-interval 30enables automatic heap snapshots every 30 minutesMemoryMax=1536MandMemoryHigh=1200Mprovide systemd-level memory limits
2. Event Store (src/store.ts - InMemoryEventStore)
Status: ✅ Well-bounded with proper cleanup
| Structure | Cap | Cleanup Mechanism |
|---|---|---|
events |
10,000 | Batch trimming (removes 100 at a time when over cap) |
sequenceIndex |
bounded by events | Pruned when events are trimmed |
workers |
time-based | 1-hour stale worker cleanup |
collisions |
time-based | 5-minute stale collision cleanup |
fileModifications |
10,000 | LRU eviction |
recentFileMods |
50,000 total, 100 per file | LRU eviction |
taskStartTimes |
time-based | 24-hour timeout |
Minor observation: The workers.activeFiles and workers.activeDirectories arrays are bounded at 200 entries per worker. With many workers, this could accumulate, but the 1-hour stale cleanup mitigates this.
3. WebSocket Broadcast (src/web/server.ts)
Status: ✅ Backpressure handling in place
- Single
JSON.stringify()for all clients (amortizes serialization cost) WS_MAX_BUFFERED_BYTES = 1 MB- clients exceeding this are terminated- Terminated clients are immediately cleaned up from the
clientsSet
No unbounded growth risk identified.
4. Parser & Normalizer (src/parser.ts, src/normalizer.ts)
Status: ✅ No obvious hot paths
- JSON parsing uses native
JSON.parse()(highly optimized) - No regex patterns that could cause catastrophic backtracking
EventDeduplicatorhas bounded LRU cache (10,000 entries)- String operations are simple splits and lookups
No allocator hot paths identified.
5. Directory Tailer (src/directoryTailer.ts)
Status: ✅ Well-bounded (bd-ch6.1 improvements)
maxActiveFiles = 200limits concurrent file watchersmaxFileInfoEntries = 50000with LRU evictionmaxRssBytes = 400 MBbackpressure skips activation under memory pressure
New Features Added
Heap Diff Analysis (src/heapDiff.ts)
New utilities for analyzing heap snapshots:
getHeapSnapshots()- List all snapshots sorted by timestampcompareSnapshots(baseline, current)- Compare two snapshotsgetRecentHeapDiff()- Get diff comparing oldest vs newest recent snapshotanalyzeTrend()- Analyze growth across all snapshotsformatTrendAsMarkdown()- Generate markdown reportssaveTrendReport()- Save trend analysis to disk
New API Endpoints
GET /api/memory/diff-analysis - Get recent heap diff
GET /api/memory/trend - Get full trend analysis
GET /api/memory/trend.md - Get trend as markdown
POST /api/memory/trend/save - Save trend report to disk
Recommendations
For Production Monitoring
- Monitor heap diff API - Set up alerts for
assessment: "leaking" - Review heap snapshots in Chrome DevTools - When leak is detected, load
.heapsnapshotfiles to identify growing retainers - Check trend reports - The
/api/memory/trend.mdendpoint provides a human-readable summary
Potential Improvements (Future Work)
-
V8 heap fragmentation - The 2GB RSS with 1GB heap limit suggests fragmentation. Consider:
- More frequent heap snapshots (10-15 min intervals) for finer granularity
- Node.js
--expose-gcflag for manual GC during low-traffic periods
-
Manager instance cleanup - The store holds references to multiple manager instances:
errorGroupManagerrecoveryManagercrossReferenceManagerworkerAnalyticssemanticNarrativeManagerhistoricalStore
These managers should be audited for internal bounds, especially
crossReferenceManagerandsemanticNarrativeManager. -
WebSocket connection storms - While backpressure exists, a sudden influx of clients could cause memory spikes. Consider:
- Maximum client cap
- Connection rate limiting
Exit Criteria Status
- ✅ Run with
--max-old-space-size=1024- Already configured in systemd - ✅ Capture
v8.getHeapSnapshot()at 30 min intervals - Already implemented - ✅ Diff snapshots to identify growing retainers - New heap diff utilities added
- ✅ Audit
src/store.tsfor unbounded arrays/maps - All structures bounded - ✅ Confirm ring-buffer behavior on event history - Batch trimming implemented
- ✅ Audit WebSocket broadcast for backpressure - 1MB buffer limit with termination
- ✅ Audit parser for regex/allocator hot paths - No issues found
- ⏳ Heap + RSS stable under steady-state load for 24h - Requires production monitoring
Next Steps
- Deploy the updated code with heap diff analysis
- Monitor
/api/memory/diff-analysisover 24-48 hours - If leak is detected, download heap snapshots and analyze in Chrome DevTools
- Based on findings, implement targeted fixes