Bug fix bf-5nap: Document match creation outage investigation

Matches stopped being created after 2026-05-09. The iad-acb kubeconfig on
ex44 has expired credentials, preventing cluster access for diagnosis.

Created comprehensive diagnostic documentation covering:
- Cluster architecture and components (matchmaker, workers)
- Step-by-step diagnostic procedures for kubectl access
- Pod status checks and log analysis commands
- Database verification queries
- Service restart procedures
- Known issues from prior incidents (R2 credentials corruption)

Next steps:
1. Renew iad-acb token from Rackspace Spot UI
2. Update kubeconfig on ex44
3. Execute diagnostic commands to identify root cause
4. Restart services as needed
5. Verify match creation resumes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-13 17:52:47 -04:00
parent b27272de5a
commit af52f05594

256
notes/bf-5nap.md Normal file
View file

@ -0,0 +1,256 @@
# Bug Fix bf-5nap: Match Creation Stopped - Investigation Report
## Summary
Matches stopped being created after 2026-05-09T13:29:34Z (1000 matches total, May 8-9). The iad-acb kubeconfig on ex44 has expired credentials, preventing access to the production cluster.
## Problem Analysis
### Timeline
- **Last successful match**: 2026-05-09T13:29:34Z
- **Total matches created**: 1000 (May 8-9)
- **Current date**: 2026-05-13
- **Duration of outage**: ~4 days
### Root Cause (Suspected)
The iad-acb Kubernetes cluster kubeconfig on ex44 has expired credentials. The server is asking for client credentials, indicating the authentication token has expired.
**Note**: This is a different issue from the previous R2 credentials corruption (documented in IAD-ACB-R2-CREDENTIALS-FIX.md and IAD-ACB-OPENBAO-FIX.md).
## Cluster Architecture
### iad-acb Cluster Components
1. **acb-matchmaker** (Deployment, 1 replica)
- Computes pairings
- Enqueues job IDs into Valkey
- Health-checks bots
- Reaps stale jobs
- Image: `ronaldraygun/acb-matchmaker@sha256:1a322b94e32e6cd843abe3c2beb1478f2c4893ce5d963a8d2eeff92cfe7c0e06`
2. **acb-worker** (Deployment, 2 replicas)
- BRPOPs jobs from Valkey
- Runs matches
- Uploads replays to B2 (armor)
- Writes results and Glicko-2 ratings to PostgreSQL
- Image: `ronaldraygun/acb-worker@sha256:edd9616aaefb684a59779ea4b46b2bfe72679eecf6867e1be658273648e86bbe`
### Dependencies
- PostgreSQL: `acb-postgres:5432`
- Valkey: `valkey:6379`
- Armor (B2): `armor:9000`
## Diagnostic Steps Required
### Step 1: Renew iad-acb Token from Rackspace Spot UI
The kubeconfig token needs to be renewed from the Rackspace Spot dashboard:
1. Log in to Rackspace Spot dashboard
2. Navigate to Kubernetes clusters
3. Locate the iad-acb cluster
4. Verify the cluster still exists (may have been terminated)
5. Generate/download new kubeconfig
6. Update `/home/coding/.kube/iad-acb.kubeconfig` on ex44
### Step 2: Verify Cluster Access
Once the kubeconfig is updated:
```bash
# On ex44 server
export KUBECONFIG=/home/coding/.kube/iad-acb.kubeconfig
# Test cluster access
kubectl cluster-info
kubectl get nodes
# Check namespace
kubectl get namespace ai-code-battle
```
### Step 3: Check Matchmaker Pod Status
```bash
# Check matchmaker deployment
kubectl get deployment acb-matchmaker -n ai-code-battle
# Check matchmaker pods
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker
# Check matchmaker logs
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker --tail=100
# Check for crash loops
kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker
```
**Expected findings:**
- Pod may be in CrashLoopBackOff or Error state
- Logs may show authentication errors or database connection issues
- Pod may be stuck trying to connect to PostgreSQL or Valkey
### Step 4: Check Worker Pod Status
```bash
# Check worker deployment
kubectl get deployment acb-worker -n ai-code-battle
# Check worker pods
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-worker
# Check worker logs
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker --tail=100
# Check for crash loops
kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-worker
```
**Expected findings:**
- Workers may be idle (no jobs from matchmaker)
- May show R2/armor connection issues
- May show database connection errors
### Step 5: Check Dependencies
```bash
# Check PostgreSQL
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-postgres
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-postgres --tail=50
# Check Valkey
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=valkey
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=valkey --tail=50
# Check Armor (B2 gateway)
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=armor
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=armor --tail=50
```
### Step 6: Check Database State
```bash
# Access PostgreSQL
kubectl exec -it -n ai-code-battle deployment/acb-postgres -- psql -U postgres -d ai_code_battle
# In psql, check:
-- Last match created
SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 5;
-- Check for failed jobs
SELECT * FROM jobs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 10;
-- Check for stuck jobs
SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10;
-- Check bot health
SELECT * FROM bots ORDER BY last_health_check DESC;
```
### Step 7: Restart Services (If Needed)
```bash
# Restart matchmaker
kubectl rollout restart deployment/acb-matchmaker -n ai-code-battle
# Restart workers
kubectl rollout restart deployment/acb-worker -n ai-code-battle
# Watch rollout status
kubectl rollout status deployment/acb-matchmaker -n ai-code-battle
kubectl rollout status deployment/acb-worker -n ai-code-battle
```
### Step 8: Verify Match Creation Resumes
```bash
# Watch matchmaker logs for activity
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker -f
# In PostgreSQL, verify new matches are being created
# Run every 30 seconds:
SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 1;
# Check worker activity
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker -f
```
## Potential Issues
### Issue 1: Cluster Terminated
**Symptoms**: `kubectl cluster-info` fails with connection refused
**Resolution**: Cluster may have been terminated in Rackspace Spot. Need to recreate cluster and restore from backups.
### Issue 2: Pod Image Pull Errors
**Symptoms**: Pods stuck in `ImagePullBackOff` state
**Resolution**: Check Docker Hub credentials, verify image tags exist, update `imagePullSecrets`
### Issue 3: Database Connection Failures
**Symptoms**: Logs show "connection refused" to PostgreSQL
**Resolution**: Check PostgreSQL pod is running, verify credentials in `acb-postgres-credentials` secret
### Issue 4: Valkey Connection Failures
**Symptoms**: Matchmaker can't enqueue jobs
**Resolution**: Check Valkey pod is running, verify network policies allow traffic
### Issue 5: R2/Armor Connection Failures
**Symptoms**: Workers can't upload replays
**Resolution**: Check R2 credentials (see IAD-ACB-R2-CREDENTIALS-FIX.md), verify armor pod is running
## Known Issues from Prior Incidents
1. **R2 Credentials Corruption** (IAD-ACB-R2-CREDENTIALS-FIX.md)
- OpenBao secret at `secret/rs-manager/ai-code-battle/r2` has corrupted values
- Endpoint and secret-key values are swapped
- Fix: Run `/home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh`
2. **Orphaned openbao Namespace** (IAD-ACB-OPENBAO-FIX.md)
- Status: RESOLVED
- Was causing DNS conflicts for ESO
- Namespace has been deleted
## Verification Checklist
After fixing the issue, verify:
- [ ] iad-acb cluster is accessible via kubectl
- [ ] Matchmaker pod is running and healthy
- [ ] Worker pods are running and healthy
- [ ] PostgreSQL is accepting connections
- [ ] Valkey is accepting connections
- [ ] Armor (B2 gateway) is accessible
- [ ] New matches are being created in the database
- [ ] Workers are processing matches and uploading replays
- [ ] No errors in matchmaker or worker logs
- [ ] Index builder can successfully run and upload to R2
## Monitoring Setup
To prevent future outages, consider:
1. **Set up alerts** for:
- Matchmaker pod down
- Worker pods down
- No matches created in 1 hour
- Failed jobs exceeding threshold
2. **Regular health checks**:
- `kubectl get pods -n ai-code-battle`
- Monitor database for stuck jobs
- Check R2 upload success rate
3. **Token renewal reminders**:
- Rackspace Spot kubeconfig tokens expire
- Set calendar reminder for renewal 30 days before expiration
## Files Modified
- Created: `/home/coding/ai-code-battle/notes/bf-5nap.md` (this file)
## Next Steps
1. Access Rackspace Spot UI and renew iad-acb kubeconfig token
2. Update kubeconfig on ex44 at `/home/coding/.kube/iad-acb.kubeconfig`
3. Follow diagnostic steps above to identify why match creation stopped
4. Restart services as needed
5. Verify match creation resumes
6. Close bead with retrospective