notes(bf-1bvca): document migration status - complete but blocked by cluster CPU
- combat_turns migration already present in declarative-config - checksum already bumped to v10-combat-turns-force-apply-2026-06-03-bf-1bvca - BLOCKED: apexalgo-iad cluster out of CPU - cnpg-apexalgo-3 pod Pending 23+ days (Insufficient cpu) - acb-postgres service has no endpoints - index-builder also Pending (Insufficient cpu) - Migration will auto-apply once postgres pod schedules
This commit is contained in:
parent
0db5d3b3a8
commit
b4c4a260c9
1 changed files with 43 additions and 99 deletions
|
|
@ -1,116 +1,60 @@
|
|||
---
|
||||
title: "BF-1BVCA: combat_turns Migration Deployment"
|
||||
date: 2026-06-04
|
||||
issue: bf-1bvca
|
||||
status: complete
|
||||
---
|
||||
# bf-1bvca: combat_turns column migration
|
||||
|
||||
## Task Summary
|
||||
|
||||
Deploy P0: add combat_turns column migration to acb-schema-init (apexalgo-iad).
|
||||
|
||||
## Problem
|
||||
|
||||
acb-index-builder crashes every 15-min cycle with:
|
||||
```
|
||||
column m.combat_turns does not exist
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
The combat_turns migration SQL was **already present** in the schema-init ConfigMap:
|
||||
- Line 46: `combat_turns INTEGER NOT NULL DEFAULT 0` in CREATE TABLE
|
||||
- Line 305: `ALTER TABLE matches ADD COLUMN IF NOT EXISTS combat_turns INTEGER NOT NULL DEFAULT 0;`
|
||||
|
||||
The issue was that the running schema-init pod (with annotation v7) had not re-run the migration SQL against the database. The `IF NOT EXISTS` clause makes the migration idempotent, but it only executes when the pod runs.
|
||||
Add `combat_turns` column migration to acb-schema-init to fix index-builder crashes.
|
||||
|
||||
## Work Completed
|
||||
|
||||
### 1. Bumped Rollout Annotation
|
||||
### Schema Migration (Already Done)
|
||||
The `combat_turns` migration was already present in `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml`:
|
||||
|
||||
File: `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml`
|
||||
1. **Line 46** - CREATE TABLE includes the column:
|
||||
```sql
|
||||
combat_turns INTEGER NOT NULL DEFAULT 0
|
||||
```
|
||||
|
||||
Changed from:
|
||||
```yaml
|
||||
checksum/schema: "v7-combat-turns-migration-2026-06-03-m"
|
||||
```
|
||||
2. **Line 305** - Migration for existing tables:
|
||||
```sql
|
||||
ALTER TABLE matches ADD COLUMN IF NOT EXISTS combat_turns INTEGER NOT NULL DEFAULT 0;
|
||||
```
|
||||
|
||||
To:
|
||||
```yaml
|
||||
checksum/schema: "v10-combat-turns-force-apply-2026-06-03-bf-1bvca"
|
||||
```
|
||||
3. **Line 508** - Checksum bumped to force reapply:
|
||||
```yaml
|
||||
checksum/schema: "v10-combat-turns-force-apply-2026-06-03-bf-1bvca"
|
||||
```
|
||||
|
||||
### 2. Committed and Pushed
|
||||
### Git History
|
||||
Multiple commits exist for this migration (declarative-config):
|
||||
- `6d7439d` - fix(acb-schema-init): bump checksum to force reapply combat_turns migration
|
||||
- `a6b9f46` - fix(ai-code-battle): bump schema-init annotation to force reapply combat_turns migration
|
||||
- `5e65253` - fix(acb): bump schema-init annotation to apply combat_turns migration
|
||||
- `503724e` - fix(apexalgo-iad): bump schema-init annotation to v7 for combat_turns migration
|
||||
|
||||
Commit: `6d7439d1acfd0be6debe95ca24318125d7d6f1b1`
|
||||
```bash
|
||||
git commit -m "fix(acb-schema-init): bump checksum to force reapply combat_turns migration"
|
||||
git push
|
||||
```
|
||||
## Current Blocker: Cluster CPU Exhaustion
|
||||
|
||||
### 3. ArgoCD Sync
|
||||
The migration **cannot be applied** because the apexalgo-iad cluster is out of CPU:
|
||||
|
||||
ArgoCD detected the annotation change and triggered a rollout of the acb-schema-init Deployment.
|
||||
### Postgres Database Status
|
||||
- **Cluster**: `cnpg-apexalgo` in `cnpg` namespace
|
||||
- **Pod Status**: `cnpg-apexalgo-3` is **Pending** (23+ days)
|
||||
- **Reason**: `0/3 nodes are available: 3 Insufficient cpu`
|
||||
- **Service Endpoints**: `acb-postgres` service has **no endpoints** (no active postgres pod)
|
||||
|
||||
## Current Cluster Status
|
||||
### Schema-init Pod Status
|
||||
- **Pod**: `acb-schema-init-7976d55cb-pwpnn` is **Running**
|
||||
- **Logs**: Stuck in retry loop waiting for postgres
|
||||
|
||||
### CPU Resource Constraint
|
||||
The apexalgo-iad cluster is experiencing **severe CPU resource exhaustion**:
|
||||
- All pods are stuck in `Pending` state with `0/3 nodes are available: 3 Insufficient cpu`
|
||||
- The new schema-init pod (v10) cannot schedule due to this constraint
|
||||
- Index-builder, worker, and other deployments are all Pending
|
||||
### Index-builder Status
|
||||
- **Pod**: `acb-index-builder-6669fdbc95-nxwhf` is **Pending**
|
||||
- **Reason**: `0/3 nodes are available: 3 Insufficient cpu`
|
||||
|
||||
### Current State (2026-06-04 02:50 UTC)
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
acb-schema-init-6cfbcc9fdc-zqhqj 1/1 Terminating 0 17m # v7 (old, terminating)
|
||||
acb-schema-init-7976d55cb-pwpnn 1/1 Running 0 6m # v10 (new)
|
||||
acb-index-builder-6669fdbc95-nxwhf 0/1 Pending 0 48m # blocked on CPU
|
||||
```
|
||||
### Node Capacity
|
||||
Total cluster capacity is ~3 vCPU across 3 nodes.
|
||||
|
||||
### PostgreSQL Status: DOWN
|
||||
- Service `acb-postgres` exists but Endpoints are `<none>`
|
||||
- CNPG cluster `cnpg-apexalgo` pods cannot schedule (CPU exhaustion)
|
||||
- schema-init pod logs: "Not ready, retrying in 5s..." (cannot connect to PostgreSQL)
|
||||
## Migration Status
|
||||
- **Code**: ✅ Complete (already in declarative-config)
|
||||
- **Applied**: ❌ Blocked (no postgres running)
|
||||
- **Verified**: ❌ Blocked (index-builder not running)
|
||||
|
||||
### Cluster CPU Status (prod-instance-17766512380750059)
|
||||
```
|
||||
Allocated: 3492m (99%) of 3500m allocatable CPU
|
||||
Used: 1131m (32%)
|
||||
```
|
||||
|
||||
All 3 nodes at capacity - new pods cannot schedule.
|
||||
|
||||
### Blocker
|
||||
The migration SQL is ready and deployed, but **cannot execute** because:
|
||||
1. Cluster CPU exhaustion prevents all new pods from scheduling
|
||||
2. PostgreSQL (CNPG) is down - its pods are stuck Pending
|
||||
3. schema-init pod is Running but cannot connect to PostgreSQL to apply migration
|
||||
|
||||
**This is an infrastructure capacity issue, not a code issue.**
|
||||
|
||||
## Task Status: Complete (Infrastructure Blocked)
|
||||
|
||||
The code changes are complete and pushed. The remaining work is infrastructure-scale:
|
||||
1. Cluster CPU capacity must be increased or pods scaled down
|
||||
2. Once CPU is available, the v10 schema-init pod will run and apply the migration
|
||||
3. Then index-builder will unblock and succeed
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `declarative-config/k8s/apexalgo-iad/ai-code-battle/acb-schema-init.yml` (annotation bump from v7 to v10)
|
||||
|
||||
## Verification (Post-Deployment)
|
||||
|
||||
Once cluster CPU is available, verify:
|
||||
```bash
|
||||
# Check schema-init pod ran successfully
|
||||
kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-schema-init --tail=50
|
||||
|
||||
# Should see:
|
||||
# "Schema applied. Tables:" followed by table listing
|
||||
|
||||
# Verify index-builder no longer crashes
|
||||
kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-index-builder --tail=100
|
||||
# Should NOT see "column m.combat_turns does not exist"
|
||||
```
|
||||
## Next Actions
|
||||
Infrastructure issue: Add more CPU to apexalgo-iad cluster or scale down workloads.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue