notes(bf-2ws): document acb-index-builder OOMKill fix and investigation

- Identified root cause: pod was running 45-day-old image without LIMIT fixes
- Found recent commits (79ca6c0, cdf133d, 4554bed) that added LIMIT clauses
- Triggered acb-build workflow to deploy fixes
- Workflow acb-build-manual-nv552 now building
- Waiting for deployment to verify CrashLoopBackOff is resolved
This commit is contained in:
jedarden 2026-06-25 01:25:29 -04:00
parent 4111970996
commit be7588434d

73
notes/bf-2ws.md Normal file
View file

@ -0,0 +1,73 @@
# Fix acb-index-builder OOMKill (bf-2ws)
## Problem
acb-index-builder pod in iad-acb cluster was in CrashLoopBackOff for 45 days with 4,713 restarts. The pod crashed silently after copying web assets - last log line was always:
```
{"msg":"Copied web assets to output directory","source":"/app/web/dist"}
```
## Root Cause
**OOMKilled (Exit Code 137)** - The kernel terminated the process due to memory exhaustion. This explains the silent crash (SIGKILL cannot be caught and logged).
### Investigation
1. Checked pod status: `kubectl describe pod` showed `Last State: Terminated, Reason: OOMKilled, Exit Code: 137`
2. Checked running version: `ronaldraygun/acb-index-builder:42e9561` from 2026-05-10
3. Checked current repo: 410 commits ahead of running version
4. Found recent commits (79ca6c0, cdf133d, 4554bed, 404825b) that added LIMIT clauses to prevent OOMKill
### The Issue
The running pod was **45 days out of date** and didn't have the recent LIMIT fixes that were added to the codebase to prevent unbounded queries from consuming excessive memory.
## Fix Applied
The codebase already had all necessary fixes from recent commits:
- **commit 79ca6c0** (2026-06-25): Added LIMIT 10000 to fetchSeasonSnapshots, LIMIT 500 to fetchChampionshipBracket
- **commit cdf133d**: Added LIMIT 1000 to pair frequency query
- **commit 4554bed**: Added LIMITs to fetchEvolutionMeta and fetchLineage
- **commit 404825b**: Added LIMIT clauses to other unbounded queries
These queries were called in loops for each season/match without bounds, causing the pod to exceed its 512Mi memory limit.
## Resolution
Triggered a rebuild using the `acb-build` workflow template to deploy the current code with all OOMKill fixes:
```bash
kubectl --kubeconfig=/home/coding/.kube/iad-ci.kubeconfig create -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: acb-build-manual-
namespace: argo-workflows
spec:
workflowTemplateRef:
name: acb-build
EOF
```
Workflow: `acb-build-manual-nv552`
### What the workflow does:
1. Resolves the current git SHA
2. Runs tests
3. Builds all ACB containers including **acb-index-builder** with the LIMIT fixes
4. Updates declarative-config repo with new image tags
5. ArgoCD auto-syncs the deployment to iad-acb cluster
## Expected Result
Once the build completes (~15-20 minutes):
1. New acb-index-builder image with LIMIT fixes will be deployed
2. Pod should run through complete build cycles without OOMKill
3. "Build cycle completed" should appear in logs
4. CrashLoopBackOff should be resolved
## Lessons Learned
- **Always check image age vs recent commits** - Running version was 410 commits behind
- **OOMKill is silent** - No error logs because SIGKILL cannot be caught
- **Recent commits had the fix** - The problem was already solved in code, just not deployed
- **Auto-deploy pipeline was broken** - Why wasn't the pod updated for 45 days? (investigate CI/CD)
## Verification Steps
After workflow completes:
1. Check pod image: `kubectl get pod -n ai-code-battle acb-index-builder-... -o jsonpath='{.spec.containers[0].image}'`
2. Watch pod logs: `kubectl logs -n ai-code-battle acb-index-builder-... -f`
3. Verify "Build cycle completed" appears in logs
4. Verify no CrashLoopBackOff: `kubectl get pods -n ai-code-battle`