From b4c4a260c9edeb55daf101e53449a34d5501e789 Mon Sep 17 00:00:00 2001 From: jedarden Date: Wed, 3 Jun 2026 23:36:10 -0400 Subject: [PATCH] notes(bf-1bvca): document migration status - complete but blocked by cluster CPU - combat_turns migration already present in declarative-config - checksum already bumped to v10-combat-turns-force-apply-2026-06-03-bf-1bvca - BLOCKED: apexalgo-iad cluster out of CPU - cnpg-apexalgo-3 pod Pending 23+ days (Insufficient cpu) - acb-postgres service has no endpoints - index-builder also Pending (Insufficient cpu) - Migration will auto-apply once postgres pod schedules --- notes/bf-1bvca.md | 142 ++++++++++++++-------------------------------- 1 file changed, 43 insertions(+), 99 deletions(-) diff --git a/notes/bf-1bvca.md b/notes/bf-1bvca.md index 9cbae1b..00eea4d 100644 --- a/notes/bf-1bvca.md +++ b/notes/bf-1bvca.md @@ -1,116 +1,60 @@ ---- -title: "BF-1BVCA: combat_turns Migration Deployment" -date: 2026-06-04 -issue: bf-1bvca -status: complete ---- +# bf-1bvca: combat_turns column migration ## Task Summary - -Deploy P0: add combat_turns column migration to acb-schema-init (apexalgo-iad). - -## Problem - -acb-index-builder crashes every 15-min cycle with: -``` -column m.combat_turns does not exist -``` - -## Root Cause Analysis - -The combat_turns migration SQL was **already present** in the schema-init ConfigMap: -- Line 46: `combat_turns INTEGER NOT NULL DEFAULT 0` in CREATE TABLE -- Line 305: `ALTER TABLE matches ADD COLUMN IF NOT EXISTS combat_turns INTEGER NOT NULL DEFAULT 0;` - -The issue was that the running schema-init pod (with annotation v7) had not re-run the migration SQL against the database. The `IF NOT EXISTS` clause makes the migration idempotent, but it only executes when the pod runs. +Add `combat_turns` column migration to acb-schema-init to fix index-builder crashes. ## Work Completed -### 1. Bumped Rollout Annotation +### Schema Migration (Already Done) +The `combat_turns` migration was already present in `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml`: -File: `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml` +1. **Line 46** - CREATE TABLE includes the column: + ```sql + combat_turns INTEGER NOT NULL DEFAULT 0 + ``` -Changed from: -```yaml -checksum/schema: "v7-combat-turns-migration-2026-06-03-m" -``` +2. **Line 305** - Migration for existing tables: + ```sql + ALTER TABLE matches ADD COLUMN IF NOT EXISTS combat_turns INTEGER NOT NULL DEFAULT 0; + ``` -To: -```yaml -checksum/schema: "v10-combat-turns-force-apply-2026-06-03-bf-1bvca" -``` +3. **Line 508** - Checksum bumped to force reapply: + ```yaml + checksum/schema: "v10-combat-turns-force-apply-2026-06-03-bf-1bvca" + ``` -### 2. Committed and Pushed +### Git History +Multiple commits exist for this migration (declarative-config): +- `6d7439d` - fix(acb-schema-init): bump checksum to force reapply combat_turns migration +- `a6b9f46` - fix(ai-code-battle): bump schema-init annotation to force reapply combat_turns migration +- `5e65253` - fix(acb): bump schema-init annotation to apply combat_turns migration +- `503724e` - fix(apexalgo-iad): bump schema-init annotation to v7 for combat_turns migration -Commit: `6d7439d1acfd0be6debe95ca24318125d7d6f1b1` -```bash -git commit -m "fix(acb-schema-init): bump checksum to force reapply combat_turns migration" -git push -``` +## Current Blocker: Cluster CPU Exhaustion -### 3. ArgoCD Sync +The migration **cannot be applied** because the apexalgo-iad cluster is out of CPU: -ArgoCD detected the annotation change and triggered a rollout of the acb-schema-init Deployment. +### Postgres Database Status +- **Cluster**: `cnpg-apexalgo` in `cnpg` namespace +- **Pod Status**: `cnpg-apexalgo-3` is **Pending** (23+ days) +- **Reason**: `0/3 nodes are available: 3 Insufficient cpu` +- **Service Endpoints**: `acb-postgres` service has **no endpoints** (no active postgres pod) -## Current Cluster Status +### Schema-init Pod Status +- **Pod**: `acb-schema-init-7976d55cb-pwpnn` is **Running** +- **Logs**: Stuck in retry loop waiting for postgres -### CPU Resource Constraint -The apexalgo-iad cluster is experiencing **severe CPU resource exhaustion**: -- All pods are stuck in `Pending` state with `0/3 nodes are available: 3 Insufficient cpu` -- The new schema-init pod (v10) cannot schedule due to this constraint -- Index-builder, worker, and other deployments are all Pending +### Index-builder Status +- **Pod**: `acb-index-builder-6669fdbc95-nxwhf` is **Pending** +- **Reason**: `0/3 nodes are available: 3 Insufficient cpu` -### Current State (2026-06-04 02:50 UTC) -``` -NAME READY STATUS RESTARTS AGE -acb-schema-init-6cfbcc9fdc-zqhqj 1/1 Terminating 0 17m # v7 (old, terminating) -acb-schema-init-7976d55cb-pwpnn 1/1 Running 0 6m # v10 (new) -acb-index-builder-6669fdbc95-nxwhf 0/1 Pending 0 48m # blocked on CPU -``` +### Node Capacity +Total cluster capacity is ~3 vCPU across 3 nodes. -### PostgreSQL Status: DOWN -- Service `acb-postgres` exists but Endpoints are `` -- CNPG cluster `cnpg-apexalgo` pods cannot schedule (CPU exhaustion) -- schema-init pod logs: "Not ready, retrying in 5s..." (cannot connect to PostgreSQL) +## Migration Status +- **Code**: ✅ Complete (already in declarative-config) +- **Applied**: ❌ Blocked (no postgres running) +- **Verified**: ❌ Blocked (index-builder not running) -### Cluster CPU Status (prod-instance-17766512380750059) -``` -Allocated: 3492m (99%) of 3500m allocatable CPU -Used: 1131m (32%) -``` - -All 3 nodes at capacity - new pods cannot schedule. - -### Blocker -The migration SQL is ready and deployed, but **cannot execute** because: -1. Cluster CPU exhaustion prevents all new pods from scheduling -2. PostgreSQL (CNPG) is down - its pods are stuck Pending -3. schema-init pod is Running but cannot connect to PostgreSQL to apply migration - -**This is an infrastructure capacity issue, not a code issue.** - -## Task Status: Complete (Infrastructure Blocked) - -The code changes are complete and pushed. The remaining work is infrastructure-scale: -1. Cluster CPU capacity must be increased or pods scaled down -2. Once CPU is available, the v10 schema-init pod will run and apply the migration -3. Then index-builder will unblock and succeed - -## Files Modified - -- `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml` (annotation bump from v7 to v10) - -## Verification (Post-Deployment) - -Once cluster CPU is available, verify: -```bash -# Check schema-init pod ran successfully -kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-schema-init --tail=50 - -# Should see: -# "Schema applied. Tables:" followed by table listing - -# Verify index-builder no longer crashes -kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-index-builder --tail=100 -# Should NOT see "column m.combat_turns does not exist" -``` +## Next Actions +Infrastructure issue: Add more CPU to apexalgo-iad cluster or scale down workloads.