notes(bf-1bvca): document migration status - complete but blocked by cluster CPU

- combat_turns migration already present in declarative-config
- checksum already bumped to v10-combat-turns-force-apply-2026-06-03-bf-1bvca
- BLOCKED: apexalgo-iad cluster out of CPU
  - cnpg-apexalgo-3 pod Pending 23+ days (Insufficient cpu)
  - acb-postgres service has no endpoints
  - index-builder also Pending (Insufficient cpu)
- Migration will auto-apply once postgres pod schedules
This commit is contained in:
jedarden 2026-06-03 23:36:10 -04:00
parent 0db5d3b3a8
commit b4c4a260c9

View file

@ -1,116 +1,60 @@
---
title: "BF-1BVCA: combat_turns Migration Deployment"
date: 2026-06-04
issue: bf-1bvca
status: complete
---
# bf-1bvca: combat_turns column migration
## Task Summary
Deploy P0: add combat_turns column migration to acb-schema-init (apexalgo-iad).
## Problem
acb-index-builder crashes every 15-min cycle with:
```
column m.combat_turns does not exist
```
## Root Cause Analysis
The combat_turns migration SQL was **already present** in the schema-init ConfigMap:
- Line 46: `combat_turns INTEGER NOT NULL DEFAULT 0` in CREATE TABLE
- Line 305: `ALTER TABLE matches ADD COLUMN IF NOT EXISTS combat_turns INTEGER NOT NULL DEFAULT 0;`
The issue was that the running schema-init pod (with annotation v7) had not re-run the migration SQL against the database. The `IF NOT EXISTS` clause makes the migration idempotent, but it only executes when the pod runs.
Add `combat_turns` column migration to acb-schema-init to fix index-builder crashes.
## Work Completed
### 1. Bumped Rollout Annotation
### Schema Migration (Already Done)
The `combat_turns` migration was already present in `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml`:
File: `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml`
1. **Line 46** - CREATE TABLE includes the column:
```sql
combat_turns INTEGER NOT NULL DEFAULT 0
```
Changed from:
```yaml
checksum/schema: "v7-combat-turns-migration-2026-06-03-m"
```
2. **Line 305** - Migration for existing tables:
```sql
ALTER TABLE matches ADD COLUMN IF NOT EXISTS combat_turns INTEGER NOT NULL DEFAULT 0;
```
To:
```yaml
checksum/schema: "v10-combat-turns-force-apply-2026-06-03-bf-1bvca"
```
3. **Line 508** - Checksum bumped to force reapply:
```yaml
checksum/schema: "v10-combat-turns-force-apply-2026-06-03-bf-1bvca"
```
### 2. Committed and Pushed
### Git History
Multiple commits exist for this migration (declarative-config):
- `6d7439d` - fix(acb-schema-init): bump checksum to force reapply combat_turns migration
- `a6b9f46` - fix(ai-code-battle): bump schema-init annotation to force reapply combat_turns migration
- `5e65253` - fix(acb): bump schema-init annotation to apply combat_turns migration
- `503724e` - fix(apexalgo-iad): bump schema-init annotation to v7 for combat_turns migration
Commit: `6d7439d1acfd0be6debe95ca24318125d7d6f1b1`
```bash
git commit -m "fix(acb-schema-init): bump checksum to force reapply combat_turns migration"
git push
```
## Current Blocker: Cluster CPU Exhaustion
### 3. ArgoCD Sync
The migration **cannot be applied** because the apexalgo-iad cluster is out of CPU:
ArgoCD detected the annotation change and triggered a rollout of the acb-schema-init Deployment.
### Postgres Database Status
- **Cluster**: `cnpg-apexalgo` in `cnpg` namespace
- **Pod Status**: `cnpg-apexalgo-3` is **Pending** (23+ days)
- **Reason**: `0/3 nodes are available: 3 Insufficient cpu`
- **Service Endpoints**: `acb-postgres` service has **no endpoints** (no active postgres pod)
## Current Cluster Status
### Schema-init Pod Status
- **Pod**: `acb-schema-init-7976d55cb-pwpnn` is **Running**
- **Logs**: Stuck in retry loop waiting for postgres
### CPU Resource Constraint
The apexalgo-iad cluster is experiencing **severe CPU resource exhaustion**:
- All pods are stuck in `Pending` state with `0/3 nodes are available: 3 Insufficient cpu`
- The new schema-init pod (v10) cannot schedule due to this constraint
- Index-builder, worker, and other deployments are all Pending
### Index-builder Status
- **Pod**: `acb-index-builder-6669fdbc95-nxwhf` is **Pending**
- **Reason**: `0/3 nodes are available: 3 Insufficient cpu`
### Current State (2026-06-04 02:50 UTC)
```
NAME READY STATUS RESTARTS AGE
acb-schema-init-6cfbcc9fdc-zqhqj 1/1 Terminating 0 17m # v7 (old, terminating)
acb-schema-init-7976d55cb-pwpnn 1/1 Running 0 6m # v10 (new)
acb-index-builder-6669fdbc95-nxwhf 0/1 Pending 0 48m # blocked on CPU
```
### Node Capacity
Total cluster capacity is ~3 vCPU across 3 nodes.
### PostgreSQL Status: DOWN
- Service `acb-postgres` exists but Endpoints are `<none>`
- CNPG cluster `cnpg-apexalgo` pods cannot schedule (CPU exhaustion)
- schema-init pod logs: "Not ready, retrying in 5s..." (cannot connect to PostgreSQL)
## Migration Status
- **Code**: ✅ Complete (already in declarative-config)
- **Applied**: ❌ Blocked (no postgres running)
- **Verified**: ❌ Blocked (index-builder not running)
### Cluster CPU Status (prod-instance-17766512380750059)
```
Allocated: 3492m (99%) of 3500m allocatable CPU
Used: 1131m (32%)
```
All 3 nodes at capacity - new pods cannot schedule.
### Blocker
The migration SQL is ready and deployed, but **cannot execute** because:
1. Cluster CPU exhaustion prevents all new pods from scheduling
2. PostgreSQL (CNPG) is down - its pods are stuck Pending
3. schema-init pod is Running but cannot connect to PostgreSQL to apply migration
**This is an infrastructure capacity issue, not a code issue.**
## Task Status: Complete (Infrastructure Blocked)
The code changes are complete and pushed. The remaining work is infrastructure-scale:
1. Cluster CPU capacity must be increased or pods scaled down
2. Once CPU is available, the v10 schema-init pod will run and apply the migration
3. Then index-builder will unblock and succeed
## Files Modified
- `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml` (annotation bump from v7 to v10)
## Verification (Post-Deployment)
Once cluster CPU is available, verify:
```bash
# Check schema-init pod ran successfully
kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-schema-init --tail=50
# Should see:
# "Schema applied. Tables:" followed by table listing
# Verify index-builder no longer crashes
kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-index-builder --tail=100
# Should NOT see "column m.combat_turns does not exist"
```
## Next Actions
Infrastructure issue: Add more CPU to apexalgo-iad cluster or scale down workloads.