# Bug Fix bf-5nap: Match Creation Stopped - Investigation Report ## Summary Matches stopped being created after 2026-05-09T13:29:34Z (1000 matches total, May 8-9). The iad-acb kubeconfig on ex44 has expired credentials, preventing access to the production cluster. ## Problem Analysis ### Timeline - **Last successful match**: 2026-05-09T13:29:34Z - **Total matches created**: 1000 (May 8-9) - **Current date**: 2026-05-13 - **Duration of outage**: ~4 days ### Root Cause (Suspected) The iad-acb Kubernetes cluster kubeconfig on ex44 has expired credentials. The server is asking for client credentials, indicating the authentication token has expired. **Note**: This is a different issue from the previous R2 credentials corruption (documented in IAD-ACB-R2-CREDENTIALS-FIX.md and IAD-ACB-OPENBAO-FIX.md). ## Cluster Architecture ### iad-acb Cluster Components 1. **acb-matchmaker** (Deployment, 1 replica) - Computes pairings - Enqueues job IDs into Valkey - Health-checks bots - Reaps stale jobs - Image: `ronaldraygun/acb-matchmaker@sha256:1a322b94e32e6cd843abe3c2beb1478f2c4893ce5d963a8d2eeff92cfe7c0e06` 2. **acb-worker** (Deployment, 2 replicas) - BRPOPs jobs from Valkey - Runs matches - Uploads replays to B2 (armor) - Writes results and Glicko-2 ratings to PostgreSQL - Image: `ronaldraygun/acb-worker@sha256:edd9616aaefb684a59779ea4b46b2bfe72679eecf6867e1be658273648e86bbe` ### Dependencies - PostgreSQL: `acb-postgres:5432` - Valkey: `valkey:6379` - Armor (B2): `armor:9000` ## Diagnostic Steps Required ### Step 1: Renew iad-acb Token from Rackspace Spot UI The kubeconfig token needs to be renewed from the Rackspace Spot dashboard: 1. Log in to Rackspace Spot dashboard 2. Navigate to Kubernetes clusters 3. Locate the iad-acb cluster 4. Verify the cluster still exists (may have been terminated) 5. Generate/download new kubeconfig 6. Update `/home/coding/.kube/iad-acb.kubeconfig` on ex44 ### Step 2: Verify Cluster Access Once the kubeconfig is updated: ```bash # On ex44 server export KUBECONFIG=/home/coding/.kube/iad-acb.kubeconfig # Test cluster access kubectl cluster-info kubectl get nodes # Check namespace kubectl get namespace ai-code-battle ``` ### Step 3: Check Matchmaker Pod Status ```bash # Check matchmaker deployment kubectl get deployment acb-matchmaker -n ai-code-battle # Check matchmaker pods kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker # Check matchmaker logs kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker --tail=100 # Check for crash loops kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker ``` **Expected findings:** - Pod may be in CrashLoopBackOff or Error state - Logs may show authentication errors or database connection issues - Pod may be stuck trying to connect to PostgreSQL or Valkey ### Step 4: Check Worker Pod Status ```bash # Check worker deployment kubectl get deployment acb-worker -n ai-code-battle # Check worker pods kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-worker # Check worker logs kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker --tail=100 # Check for crash loops kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-worker ``` **Expected findings:** - Workers may be idle (no jobs from matchmaker) - May show R2/armor connection issues - May show database connection errors ### Step 5: Check Dependencies ```bash # Check PostgreSQL kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-postgres kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-postgres --tail=50 # Check Valkey kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=valkey kubectl logs -n ai-code-battle -l app.kubernetes.io/name=valkey --tail=50 # Check Armor (B2 gateway) kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=armor kubectl logs -n ai-code-battle -l app.kubernetes.io/name=armor --tail=50 ``` ### Step 6: Check Database State ```bash # Access PostgreSQL kubectl exec -it -n ai-code-battle deployment/acb-postgres -- psql -U postgres -d ai_code_battle # In psql, check: -- Last match created SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 5; -- Check for failed jobs SELECT * FROM jobs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 10; -- Check for stuck jobs SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10; -- Check bot health SELECT * FROM bots ORDER BY last_health_check DESC; ``` ### Step 7: Restart Services (If Needed) ```bash # Restart matchmaker kubectl rollout restart deployment/acb-matchmaker -n ai-code-battle # Restart workers kubectl rollout restart deployment/acb-worker -n ai-code-battle # Watch rollout status kubectl rollout status deployment/acb-matchmaker -n ai-code-battle kubectl rollout status deployment/acb-worker -n ai-code-battle ``` ### Step 8: Verify Match Creation Resumes ```bash # Watch matchmaker logs for activity kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker -f # In PostgreSQL, verify new matches are being created # Run every 30 seconds: SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 1; # Check worker activity kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker -f ``` ## Potential Issues ### Issue 1: Cluster Terminated **Symptoms**: `kubectl cluster-info` fails with connection refused **Resolution**: Cluster may have been terminated in Rackspace Spot. Need to recreate cluster and restore from backups. ### Issue 2: Pod Image Pull Errors **Symptoms**: Pods stuck in `ImagePullBackOff` state **Resolution**: Check Docker Hub credentials, verify image tags exist, update `imagePullSecrets` ### Issue 3: Database Connection Failures **Symptoms**: Logs show "connection refused" to PostgreSQL **Resolution**: Check PostgreSQL pod is running, verify credentials in `acb-postgres-credentials` secret ### Issue 4: Valkey Connection Failures **Symptoms**: Matchmaker can't enqueue jobs **Resolution**: Check Valkey pod is running, verify network policies allow traffic ### Issue 5: R2/Armor Connection Failures **Symptoms**: Workers can't upload replays **Resolution**: Check R2 credentials (see IAD-ACB-R2-CREDENTIALS-FIX.md), verify armor pod is running ## Known Issues from Prior Incidents 1. **R2 Credentials Corruption** (IAD-ACB-R2-CREDENTIALS-FIX.md) - OpenBao secret at `secret/rs-manager/ai-code-battle/r2` has corrupted values - Endpoint and secret-key values are swapped - Fix: Run `/home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh` 2. **Orphaned openbao Namespace** (IAD-ACB-OPENBAO-FIX.md) - Status: RESOLVED - Was causing DNS conflicts for ESO - Namespace has been deleted ## Verification Checklist After fixing the issue, verify: - [ ] iad-acb cluster is accessible via kubectl - [ ] Matchmaker pod is running and healthy - [ ] Worker pods are running and healthy - [ ] PostgreSQL is accepting connections - [ ] Valkey is accepting connections - [ ] Armor (B2 gateway) is accessible - [ ] New matches are being created in the database - [ ] Workers are processing matches and uploading replays - [ ] No errors in matchmaker or worker logs - [ ] Index builder can successfully run and upload to R2 ## Monitoring Setup To prevent future outages, consider: 1. **Set up alerts** for: - Matchmaker pod down - Worker pods down - No matches created in 1 hour - Failed jobs exceeding threshold 2. **Regular health checks**: - `kubectl get pods -n ai-code-battle` - Monitor database for stuck jobs - Check R2 upload success rate 3. **Token renewal reminders**: - Rackspace Spot kubeconfig tokens expire - Set calendar reminder for renewal 30 days before expiration ## Files Modified - Created: `/home/coding/ai-code-battle/notes/bf-5nap.md` (this file) ## Next Steps 1. Access Rackspace Spot UI and renew iad-acb kubeconfig token 2. Update kubeconfig on ex44 at `/home/coding/.kube/iad-acb.kubeconfig` 3. Follow diagnostic steps above to identify why match creation stopped 4. Restart services as needed 5. Verify match creation resumes 6. Close bead with retrospective