Confirms that all OOMKill fixes are already applied in the deployed image: - db.go: Batch queries with LIMIT clauses to prevent unbounded results - generator.go: O(1) lookup maps instead of O(n²) iteration - main.go: Panic recovery mechanism for silent crashes Current pod is PENDING due to cluster resource constraints (98% CPU allocation), not due to application code issues. Once scheduled, the fixes should prevent the original CrashLoopBackOff issue.
4.2 KiB
4.2 KiB
acb-index-builder Investigation Summary
Issue Status: RESOLVED (Code fixes applied), BLOCKED (Infrastructure)
Original Problem
- acb-index-builder was in CrashLoopBackOff for 45 days with 4713 restarts
- Silent crash after "Copied web assets to output directory" log line
- Suspected causes: OOMKill, panic, or unbounded queries
Root Cause Analysis
The issue was caused by multiple O(n²) complexity problems leading to OOMKill:
- fetchBots in db.go: O(n²) N+1 query loop - querying match stats for each bot separately (10,000+ queries)
- generateBotProfiles in generator.go: O(n²) iteration - linear scans through all bots and matches for each bot profile
- fetchSeries in db.go: O(n²) N+1 query loop - querying series games separately for each series
- Unbounded queries: Missing LIMIT clauses on large result sets
Fixes Applied (Already in Codebase)
1. Database Query Fixes (db.go)
- fetchBots (lines 338-376): Single batch query for all bot match stats with LIMIT 20000
- fetchSeries (lines 538-603): Batch query for all series games with LIMIT 10000
- fetchChampionshipBracket (lines 805-866): Batch query for games with LIMIT 500
- All queries: Added LIMIT clauses (500-5000 range) to prevent unbounded results
2. Generator Optimization (generator.go)
- generateBotProfiles (lines 275-301): Pre-build lookup maps for O(1) lookups
historyMap: botID → rating history entriesbotNameMap: botID → bot namematchMap: botID → recent matches (up to 20)
- buildFirstMatchPerBot (line 1315): O(n*p) vs O(n²) for debut detection
- buildPairFrequency (line 1348): O(n) vs O(n²) for rivalry detection
- isNewBotDebutFast (line 1334): O(1) lookup using pre-built maps
- isRivalryMatchFast (line 1365): O(1) lookup using pre-built frequency maps
3. Panic Recovery (main.go)
- runBuildCycle (lines 165-172): Added deferred recover() that logs via slog before re-panicking
- Prevents silent crashes where panic output (stderr) is lost
Current Situation
Pod Status: PENDING (not CrashLoopBackOff)
- Current pod:
acb-index-builder-7fc99df58b-5zjpp(42m old) - Image:
ronaldraygun/acb-index-builder:b35a2aa(contains all fixes) - Issue: Insufficient cluster resources for scheduling
Cluster Status:
- Node: prod-instance-17759444681370612
- Capacity: 2 CPU, ~3.8GB RAM
- Current allocation: 98% CPU, 59% memory
- Error: "0/2 nodes are available: 1 Insufficient memory, 2 Insufficient cpu"
Code Status: ✅ All fixes are in the deployed image
- Commit
b35a2aaincludes the critical fetchBots fix - All O(n²) complexity issues resolved
- Panic recovery mechanism in place
Verification Needed
The pod cannot be scheduled due to cluster resource constraints. To verify the fixes work:
- Option A: Scale down non-critical workloads temporarily to free resources
- Option B: Add a new node to the cluster
- Option C: Increase cluster node capacity
- Option D: Wait for existing pods to complete/restart
Once the pod runs successfully through 2+ build cycles with "Build cycle completed" logs, the fix is verified.
Acceptance Criteria Met (Pending Pod Scheduling)
- ✅ Code fixes applied (LIMIT clauses, batch queries, O(1) lookups)
- ✅ Panic recovery mechanism added
- ⏳ Pod runs through 2+ build cycles without restart (blocked by cluster resources)
- ⏳ "Build cycle completed" appears in logs (blocked by cluster resources)
- ⏳ No CrashLoopBackOff in kubectl get pods (blocked by cluster resources)
Files Modified
cmd/acb-index-builder/db.go- Database query optimizationcmd/acb-index-builder/generator.go- O(n²) complexity fixescmd/acb-index-builder/main.go- Panic recovery mechanism
References
- Git commit
b35a2aa: "fix(db): eliminate O(n²) N+1 query loop in fetchBots to prevent OOMKill" - Git commit
be9a070: "fix(db): add LIMIT to bot match stats query to prevent OOMKill" - Git commit
68b7864: "fix(db): add LIMIT to fetchRecentMatchIds query to prevent OOMKill" - Git commit
ca48b60: "fix(db): add LIMIT to fetchSeriesGames query to prevent OOMKill" - Git commit
7befe51: "fix(db): eliminate O(n²) iteration in generateBotProfiles"