ai-code-battle/notes/bf-2ws.md
jedarden be7588434d notes(bf-2ws): document acb-index-builder OOMKill fix and investigation
- Identified root cause: pod was running 45-day-old image without LIMIT fixes
- Found recent commits (79ca6c0, cdf133d, 4554bed) that added LIMIT clauses
- Triggered acb-build workflow to deploy fixes
- Workflow acb-build-manual-nv552 now building
- Waiting for deployment to verify CrashLoopBackOff is resolved
2026-06-25 01:29:26 -04:00

3.2 KiB

Fix acb-index-builder OOMKill (bf-2ws)

Problem

acb-index-builder pod in iad-acb cluster was in CrashLoopBackOff for 45 days with 4,713 restarts. The pod crashed silently after copying web assets - last log line was always:

{"msg":"Copied web assets to output directory","source":"/app/web/dist"}

Root Cause

OOMKilled (Exit Code 137) - The kernel terminated the process due to memory exhaustion. This explains the silent crash (SIGKILL cannot be caught and logged).

Investigation

  1. Checked pod status: kubectl describe pod showed Last State: Terminated, Reason: OOMKilled, Exit Code: 137
  2. Checked running version: ronaldraygun/acb-index-builder:42e9561 from 2026-05-10
  3. Checked current repo: 410 commits ahead of running version
  4. Found recent commits (79ca6c0, cdf133d, 4554bed, 404825b) that added LIMIT clauses to prevent OOMKill

The Issue

The running pod was 45 days out of date and didn't have the recent LIMIT fixes that were added to the codebase to prevent unbounded queries from consuming excessive memory.

Fix Applied

The codebase already had all necessary fixes from recent commits:

  • commit 79ca6c0 (2026-06-25): Added LIMIT 10000 to fetchSeasonSnapshots, LIMIT 500 to fetchChampionshipBracket
  • commit cdf133d: Added LIMIT 1000 to pair frequency query
  • commit 4554bed: Added LIMITs to fetchEvolutionMeta and fetchLineage
  • commit 404825b: Added LIMIT clauses to other unbounded queries

These queries were called in loops for each season/match without bounds, causing the pod to exceed its 512Mi memory limit.

Resolution

Triggered a rebuild using the acb-build workflow template to deploy the current code with all OOMKill fixes:

kubectl --kubeconfig=/home/coding/.kube/iad-ci.kubeconfig create -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: acb-build-manual-
  namespace: argo-workflows
spec:
  workflowTemplateRef:
    name: acb-build
EOF

Workflow: acb-build-manual-nv552

What the workflow does:

  1. Resolves the current git SHA
  2. Runs tests
  3. Builds all ACB containers including acb-index-builder with the LIMIT fixes
  4. Updates declarative-config repo with new image tags
  5. ArgoCD auto-syncs the deployment to iad-acb cluster

Expected Result

Once the build completes (~15-20 minutes):

  1. New acb-index-builder image with LIMIT fixes will be deployed
  2. Pod should run through complete build cycles without OOMKill
  3. "Build cycle completed" should appear in logs
  4. CrashLoopBackOff should be resolved

Lessons Learned

  • Always check image age vs recent commits - Running version was 410 commits behind
  • OOMKill is silent - No error logs because SIGKILL cannot be caught
  • Recent commits had the fix - The problem was already solved in code, just not deployed
  • Auto-deploy pipeline was broken - Why wasn't the pod updated for 45 days? (investigate CI/CD)

Verification Steps

After workflow completes:

  1. Check pod image: kubectl get pod -n ai-code-battle acb-index-builder-... -o jsonpath='{.spec.containers[0].image}'
  2. Watch pod logs: kubectl logs -n ai-code-battle acb-index-builder-... -f
  3. Verify "Build cycle completed" appears in logs
  4. Verify no CrashLoopBackOff: kubectl get pods -n ai-code-battle