2.2 KiB
2.2 KiB
Fix acb-index-builder CrashLoopBackOff (Bead bf-2ws)
Problem
acb-index-builder (iad-acb cluster) was in CrashLoopBackOff for 45 days with 4713 restarts. The pod crashed silently after the log line:
{"msg":"Copied web assets to output directory","source":"/app/web/dist"}
No error was logged, indicating either:
- A panic in fetchAllData or generateAllIndexes (unlikely - defer recover() logs panics)
- OOMKill (most likely - kernel SIGKILL terminates before any Go code can log)
Root Cause
Investigation of cmd/acb-index-builder/db.go revealed multiple O(n²) N+1 query problems causing unbounded memory growth:
- fetchSeries (line 531): For each of up to 1000 series, called fetchSeriesGames making 1000 separate queries
- fetchChampionshipBracket (line 736): For each of up to 500 series, called fetchSeriesGames making 500 separate queries
- fetchSeasonSnapshots: LIMIT 10000 was excessive
- fetchLineage: LIMIT 10000 was excessive
The crash occurred in fetchAllData() which runs immediately after copyWebAssets().
Fix Applied
Modified cmd/acb-index-builder/db.go:
- fetchSeries: Batched games queries (1000 queries → 1 query with WHERE IN clause, LIMIT 10000)
- fetchChampionshipBracket: Batched games queries (500 queries → 1 query with WHERE IN clause, LIMIT 64)
- fetchSeasonSnapshots: Reduced LIMIT from 10000 to 500
- fetchLineage: Reduced LIMIT from 10000 to 1000
- Added
stringsimport for strings.Join in batch queries
Deployment
- Committed and pushed changes to ai-code-battle repo
- Submitted container-build workflow to iad-ci cluster to rebuild acb-index-builder container
- Workflow name: acb-index-builder-build-v25d2 (status: Running)
Expected Outcome
Once the new image is deployed to iad-acb cluster:
- acb-index-builder should complete build cycles without OOMKill
- "Build cycle completed" log line should appear
- Pod should stop restarting and exit CrashLoopBackOff
Verification Needed
After deployment, monitor:
kubectl --server=http://traefik-iad-acb:8001 logs -n ai-code-battle acb-index-builder-XXX -f
Look for:
- "Build cycle completed" log line
- No crashes after "Copied web assets to output directory"
- Stable pod state (not CrashLoopBackOff)