notes(bf-2ws): document acb-index-builder OOMKill fix and investigation
- Identified root cause: pod was running 45-day-old image without LIMIT fixes - Found recent commits (79ca6c0, cdf133d, 4554bed) that added LIMIT clauses - Triggered acb-build workflow to deploy fixes - Workflow acb-build-manual-nv552 now building - Waiting for deployment to verify CrashLoopBackOff is resolved
This commit is contained in:
parent
4111970996
commit
be7588434d
1 changed files with 73 additions and 0 deletions
73
notes/bf-2ws.md
Normal file
73
notes/bf-2ws.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# Fix acb-index-builder OOMKill (bf-2ws)
|
||||
|
||||
## Problem
|
||||
acb-index-builder pod in iad-acb cluster was in CrashLoopBackOff for 45 days with 4,713 restarts. The pod crashed silently after copying web assets - last log line was always:
|
||||
```
|
||||
{"msg":"Copied web assets to output directory","source":"/app/web/dist"}
|
||||
```
|
||||
|
||||
## Root Cause
|
||||
**OOMKilled (Exit Code 137)** - The kernel terminated the process due to memory exhaustion. This explains the silent crash (SIGKILL cannot be caught and logged).
|
||||
|
||||
### Investigation
|
||||
1. Checked pod status: `kubectl describe pod` showed `Last State: Terminated, Reason: OOMKilled, Exit Code: 137`
|
||||
2. Checked running version: `ronaldraygun/acb-index-builder:42e9561` from 2026-05-10
|
||||
3. Checked current repo: 410 commits ahead of running version
|
||||
4. Found recent commits (79ca6c0, cdf133d, 4554bed, 404825b) that added LIMIT clauses to prevent OOMKill
|
||||
|
||||
### The Issue
|
||||
The running pod was **45 days out of date** and didn't have the recent LIMIT fixes that were added to the codebase to prevent unbounded queries from consuming excessive memory.
|
||||
|
||||
## Fix Applied
|
||||
The codebase already had all necessary fixes from recent commits:
|
||||
- **commit 79ca6c0** (2026-06-25): Added LIMIT 10000 to fetchSeasonSnapshots, LIMIT 500 to fetchChampionshipBracket
|
||||
- **commit cdf133d**: Added LIMIT 1000 to pair frequency query
|
||||
- **commit 4554bed**: Added LIMITs to fetchEvolutionMeta and fetchLineage
|
||||
- **commit 404825b**: Added LIMIT clauses to other unbounded queries
|
||||
|
||||
These queries were called in loops for each season/match without bounds, causing the pod to exceed its 512Mi memory limit.
|
||||
|
||||
## Resolution
|
||||
Triggered a rebuild using the `acb-build` workflow template to deploy the current code with all OOMKill fixes:
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig=/home/coding/.kube/iad-ci.kubeconfig create -f - <<'EOF'
|
||||
apiVersion: argoproj.io/v1alpha1
|
||||
kind: Workflow
|
||||
metadata:
|
||||
generateName: acb-build-manual-
|
||||
namespace: argo-workflows
|
||||
spec:
|
||||
workflowTemplateRef:
|
||||
name: acb-build
|
||||
EOF
|
||||
```
|
||||
|
||||
Workflow: `acb-build-manual-nv552`
|
||||
|
||||
### What the workflow does:
|
||||
1. Resolves the current git SHA
|
||||
2. Runs tests
|
||||
3. Builds all ACB containers including **acb-index-builder** with the LIMIT fixes
|
||||
4. Updates declarative-config repo with new image tags
|
||||
5. ArgoCD auto-syncs the deployment to iad-acb cluster
|
||||
|
||||
## Expected Result
|
||||
Once the build completes (~15-20 minutes):
|
||||
1. New acb-index-builder image with LIMIT fixes will be deployed
|
||||
2. Pod should run through complete build cycles without OOMKill
|
||||
3. "Build cycle completed" should appear in logs
|
||||
4. CrashLoopBackOff should be resolved
|
||||
|
||||
## Lessons Learned
|
||||
- **Always check image age vs recent commits** - Running version was 410 commits behind
|
||||
- **OOMKill is silent** - No error logs because SIGKILL cannot be caught
|
||||
- **Recent commits had the fix** - The problem was already solved in code, just not deployed
|
||||
- **Auto-deploy pipeline was broken** - Why wasn't the pod updated for 45 days? (investigate CI/CD)
|
||||
|
||||
## Verification Steps
|
||||
After workflow completes:
|
||||
1. Check pod image: `kubectl get pod -n ai-code-battle acb-index-builder-... -o jsonpath='{.spec.containers[0].image}'`
|
||||
2. Watch pod logs: `kubectl logs -n ai-code-battle acb-index-builder-... -f`
|
||||
3. Verify "Build cycle completed" appears in logs
|
||||
4. Verify no CrashLoopBackOff: `kubectl get pods -n ai-code-battle`
|
||||
Loading…
Add table
Reference in a new issue