From be7588434d6961b2036a07c21809894f8e3a7cd3 Mon Sep 17 00:00:00 2001 From: jedarden Date: Thu, 25 Jun 2026 01:25:29 -0400 Subject: [PATCH] notes(bf-2ws): document acb-index-builder OOMKill fix and investigation - Identified root cause: pod was running 45-day-old image without LIMIT fixes - Found recent commits (79ca6c0, cdf133d, 4554bed) that added LIMIT clauses - Triggered acb-build workflow to deploy fixes - Workflow acb-build-manual-nv552 now building - Waiting for deployment to verify CrashLoopBackOff is resolved --- notes/bf-2ws.md | 73 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) create mode 100644 notes/bf-2ws.md diff --git a/notes/bf-2ws.md b/notes/bf-2ws.md new file mode 100644 index 0000000..72df023 --- /dev/null +++ b/notes/bf-2ws.md @@ -0,0 +1,73 @@ +# Fix acb-index-builder OOMKill (bf-2ws) + +## Problem +acb-index-builder pod in iad-acb cluster was in CrashLoopBackOff for 45 days with 4,713 restarts. The pod crashed silently after copying web assets - last log line was always: +``` +{"msg":"Copied web assets to output directory","source":"/app/web/dist"} +``` + +## Root Cause +**OOMKilled (Exit Code 137)** - The kernel terminated the process due to memory exhaustion. This explains the silent crash (SIGKILL cannot be caught and logged). + +### Investigation +1. Checked pod status: `kubectl describe pod` showed `Last State: Terminated, Reason: OOMKilled, Exit Code: 137` +2. Checked running version: `ronaldraygun/acb-index-builder:42e9561` from 2026-05-10 +3. Checked current repo: 410 commits ahead of running version +4. Found recent commits (79ca6c0, cdf133d, 4554bed, 404825b) that added LIMIT clauses to prevent OOMKill + +### The Issue +The running pod was **45 days out of date** and didn't have the recent LIMIT fixes that were added to the codebase to prevent unbounded queries from consuming excessive memory. + +## Fix Applied +The codebase already had all necessary fixes from recent commits: +- **commit 79ca6c0** (2026-06-25): Added LIMIT 10000 to fetchSeasonSnapshots, LIMIT 500 to fetchChampionshipBracket +- **commit cdf133d**: Added LIMIT 1000 to pair frequency query +- **commit 4554bed**: Added LIMITs to fetchEvolutionMeta and fetchLineage +- **commit 404825b**: Added LIMIT clauses to other unbounded queries + +These queries were called in loops for each season/match without bounds, causing the pod to exceed its 512Mi memory limit. + +## Resolution +Triggered a rebuild using the `acb-build` workflow template to deploy the current code with all OOMKill fixes: + +```bash +kubectl --kubeconfig=/home/coding/.kube/iad-ci.kubeconfig create -f - <<'EOF' +apiVersion: argoproj.io/v1alpha1 +kind: Workflow +metadata: + generateName: acb-build-manual- + namespace: argo-workflows +spec: + workflowTemplateRef: + name: acb-build +EOF +``` + +Workflow: `acb-build-manual-nv552` + +### What the workflow does: +1. Resolves the current git SHA +2. Runs tests +3. Builds all ACB containers including **acb-index-builder** with the LIMIT fixes +4. Updates declarative-config repo with new image tags +5. ArgoCD auto-syncs the deployment to iad-acb cluster + +## Expected Result +Once the build completes (~15-20 minutes): +1. New acb-index-builder image with LIMIT fixes will be deployed +2. Pod should run through complete build cycles without OOMKill +3. "Build cycle completed" should appear in logs +4. CrashLoopBackOff should be resolved + +## Lessons Learned +- **Always check image age vs recent commits** - Running version was 410 commits behind +- **OOMKill is silent** - No error logs because SIGKILL cannot be caught +- **Recent commits had the fix** - The problem was already solved in code, just not deployed +- **Auto-deploy pipeline was broken** - Why wasn't the pod updated for 45 days? (investigate CI/CD) + +## Verification Steps +After workflow completes: +1. Check pod image: `kubectl get pod -n ai-code-battle acb-index-builder-... -o jsonpath='{.spec.containers[0].image}'` +2. Watch pod logs: `kubectl logs -n ai-code-battle acb-index-builder-... -f` +3. Verify "Build cycle completed" appears in logs +4. Verify no CrashLoopBackOff: `kubectl get pods -n ai-code-battle`