Bug fix bf-5nap: Document match creation outage investigation
Matches stopped being created after 2026-05-09. The iad-acb kubeconfig on ex44 has expired credentials, preventing cluster access for diagnosis. Created comprehensive diagnostic documentation covering: - Cluster architecture and components (matchmaker, workers) - Step-by-step diagnostic procedures for kubectl access - Pod status checks and log analysis commands - Database verification queries - Service restart procedures - Known issues from prior incidents (R2 credentials corruption) Next steps: 1. Renew iad-acb token from Rackspace Spot UI 2. Update kubeconfig on ex44 3. Execute diagnostic commands to identify root cause 4. Restart services as needed 5. Verify match creation resumes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
b27272de5a
commit
af52f05594
1 changed files with 256 additions and 0 deletions
256
notes/bf-5nap.md
Normal file
256
notes/bf-5nap.md
Normal file
|
|
@ -0,0 +1,256 @@
|
|||
# Bug Fix bf-5nap: Match Creation Stopped - Investigation Report
|
||||
|
||||
## Summary
|
||||
|
||||
Matches stopped being created after 2026-05-09T13:29:34Z (1000 matches total, May 8-9). The iad-acb kubeconfig on ex44 has expired credentials, preventing access to the production cluster.
|
||||
|
||||
## Problem Analysis
|
||||
|
||||
### Timeline
|
||||
- **Last successful match**: 2026-05-09T13:29:34Z
|
||||
- **Total matches created**: 1000 (May 8-9)
|
||||
- **Current date**: 2026-05-13
|
||||
- **Duration of outage**: ~4 days
|
||||
|
||||
### Root Cause (Suspected)
|
||||
The iad-acb Kubernetes cluster kubeconfig on ex44 has expired credentials. The server is asking for client credentials, indicating the authentication token has expired.
|
||||
|
||||
**Note**: This is a different issue from the previous R2 credentials corruption (documented in IAD-ACB-R2-CREDENTIALS-FIX.md and IAD-ACB-OPENBAO-FIX.md).
|
||||
|
||||
## Cluster Architecture
|
||||
|
||||
### iad-acb Cluster Components
|
||||
1. **acb-matchmaker** (Deployment, 1 replica)
|
||||
- Computes pairings
|
||||
- Enqueues job IDs into Valkey
|
||||
- Health-checks bots
|
||||
- Reaps stale jobs
|
||||
- Image: `ronaldraygun/acb-matchmaker@sha256:1a322b94e32e6cd843abe3c2beb1478f2c4893ce5d963a8d2eeff92cfe7c0e06`
|
||||
|
||||
2. **acb-worker** (Deployment, 2 replicas)
|
||||
- BRPOPs jobs from Valkey
|
||||
- Runs matches
|
||||
- Uploads replays to B2 (armor)
|
||||
- Writes results and Glicko-2 ratings to PostgreSQL
|
||||
- Image: `ronaldraygun/acb-worker@sha256:edd9616aaefb684a59779ea4b46b2bfe72679eecf6867e1be658273648e86bbe`
|
||||
|
||||
### Dependencies
|
||||
- PostgreSQL: `acb-postgres:5432`
|
||||
- Valkey: `valkey:6379`
|
||||
- Armor (B2): `armor:9000`
|
||||
|
||||
## Diagnostic Steps Required
|
||||
|
||||
### Step 1: Renew iad-acb Token from Rackspace Spot UI
|
||||
|
||||
The kubeconfig token needs to be renewed from the Rackspace Spot dashboard:
|
||||
|
||||
1. Log in to Rackspace Spot dashboard
|
||||
2. Navigate to Kubernetes clusters
|
||||
3. Locate the iad-acb cluster
|
||||
4. Verify the cluster still exists (may have been terminated)
|
||||
5. Generate/download new kubeconfig
|
||||
6. Update `/home/coding/.kube/iad-acb.kubeconfig` on ex44
|
||||
|
||||
### Step 2: Verify Cluster Access
|
||||
|
||||
Once the kubeconfig is updated:
|
||||
|
||||
```bash
|
||||
# On ex44 server
|
||||
export KUBECONFIG=/home/coding/.kube/iad-acb.kubeconfig
|
||||
|
||||
# Test cluster access
|
||||
kubectl cluster-info
|
||||
kubectl get nodes
|
||||
|
||||
# Check namespace
|
||||
kubectl get namespace ai-code-battle
|
||||
```
|
||||
|
||||
### Step 3: Check Matchmaker Pod Status
|
||||
|
||||
```bash
|
||||
# Check matchmaker deployment
|
||||
kubectl get deployment acb-matchmaker -n ai-code-battle
|
||||
|
||||
# Check matchmaker pods
|
||||
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker
|
||||
|
||||
# Check matchmaker logs
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker --tail=100
|
||||
|
||||
# Check for crash loops
|
||||
kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker
|
||||
```
|
||||
|
||||
**Expected findings:**
|
||||
- Pod may be in CrashLoopBackOff or Error state
|
||||
- Logs may show authentication errors or database connection issues
|
||||
- Pod may be stuck trying to connect to PostgreSQL or Valkey
|
||||
|
||||
### Step 4: Check Worker Pod Status
|
||||
|
||||
```bash
|
||||
# Check worker deployment
|
||||
kubectl get deployment acb-worker -n ai-code-battle
|
||||
|
||||
# Check worker pods
|
||||
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-worker
|
||||
|
||||
# Check worker logs
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker --tail=100
|
||||
|
||||
# Check for crash loops
|
||||
kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-worker
|
||||
```
|
||||
|
||||
**Expected findings:**
|
||||
- Workers may be idle (no jobs from matchmaker)
|
||||
- May show R2/armor connection issues
|
||||
- May show database connection errors
|
||||
|
||||
### Step 5: Check Dependencies
|
||||
|
||||
```bash
|
||||
# Check PostgreSQL
|
||||
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-postgres
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-postgres --tail=50
|
||||
|
||||
# Check Valkey
|
||||
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=valkey
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=valkey --tail=50
|
||||
|
||||
# Check Armor (B2 gateway)
|
||||
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=armor
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=armor --tail=50
|
||||
```
|
||||
|
||||
### Step 6: Check Database State
|
||||
|
||||
```bash
|
||||
# Access PostgreSQL
|
||||
kubectl exec -it -n ai-code-battle deployment/acb-postgres -- psql -U postgres -d ai_code_battle
|
||||
|
||||
# In psql, check:
|
||||
-- Last match created
|
||||
SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 5;
|
||||
|
||||
-- Check for failed jobs
|
||||
SELECT * FROM jobs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 10;
|
||||
|
||||
-- Check for stuck jobs
|
||||
SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10;
|
||||
|
||||
-- Check bot health
|
||||
SELECT * FROM bots ORDER BY last_health_check DESC;
|
||||
```
|
||||
|
||||
### Step 7: Restart Services (If Needed)
|
||||
|
||||
```bash
|
||||
# Restart matchmaker
|
||||
kubectl rollout restart deployment/acb-matchmaker -n ai-code-battle
|
||||
|
||||
# Restart workers
|
||||
kubectl rollout restart deployment/acb-worker -n ai-code-battle
|
||||
|
||||
# Watch rollout status
|
||||
kubectl rollout status deployment/acb-matchmaker -n ai-code-battle
|
||||
kubectl rollout status deployment/acb-worker -n ai-code-battle
|
||||
```
|
||||
|
||||
### Step 8: Verify Match Creation Resumes
|
||||
|
||||
```bash
|
||||
# Watch matchmaker logs for activity
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker -f
|
||||
|
||||
# In PostgreSQL, verify new matches are being created
|
||||
# Run every 30 seconds:
|
||||
SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 1;
|
||||
|
||||
# Check worker activity
|
||||
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker -f
|
||||
```
|
||||
|
||||
## Potential Issues
|
||||
|
||||
### Issue 1: Cluster Terminated
|
||||
**Symptoms**: `kubectl cluster-info` fails with connection refused
|
||||
**Resolution**: Cluster may have been terminated in Rackspace Spot. Need to recreate cluster and restore from backups.
|
||||
|
||||
### Issue 2: Pod Image Pull Errors
|
||||
**Symptoms**: Pods stuck in `ImagePullBackOff` state
|
||||
**Resolution**: Check Docker Hub credentials, verify image tags exist, update `imagePullSecrets`
|
||||
|
||||
### Issue 3: Database Connection Failures
|
||||
**Symptoms**: Logs show "connection refused" to PostgreSQL
|
||||
**Resolution**: Check PostgreSQL pod is running, verify credentials in `acb-postgres-credentials` secret
|
||||
|
||||
### Issue 4: Valkey Connection Failures
|
||||
**Symptoms**: Matchmaker can't enqueue jobs
|
||||
**Resolution**: Check Valkey pod is running, verify network policies allow traffic
|
||||
|
||||
### Issue 5: R2/Armor Connection Failures
|
||||
**Symptoms**: Workers can't upload replays
|
||||
**Resolution**: Check R2 credentials (see IAD-ACB-R2-CREDENTIALS-FIX.md), verify armor pod is running
|
||||
|
||||
## Known Issues from Prior Incidents
|
||||
|
||||
1. **R2 Credentials Corruption** (IAD-ACB-R2-CREDENTIALS-FIX.md)
|
||||
- OpenBao secret at `secret/rs-manager/ai-code-battle/r2` has corrupted values
|
||||
- Endpoint and secret-key values are swapped
|
||||
- Fix: Run `/home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh`
|
||||
|
||||
2. **Orphaned openbao Namespace** (IAD-ACB-OPENBAO-FIX.md)
|
||||
- Status: RESOLVED
|
||||
- Was causing DNS conflicts for ESO
|
||||
- Namespace has been deleted
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After fixing the issue, verify:
|
||||
|
||||
- [ ] iad-acb cluster is accessible via kubectl
|
||||
- [ ] Matchmaker pod is running and healthy
|
||||
- [ ] Worker pods are running and healthy
|
||||
- [ ] PostgreSQL is accepting connections
|
||||
- [ ] Valkey is accepting connections
|
||||
- [ ] Armor (B2 gateway) is accessible
|
||||
- [ ] New matches are being created in the database
|
||||
- [ ] Workers are processing matches and uploading replays
|
||||
- [ ] No errors in matchmaker or worker logs
|
||||
- [ ] Index builder can successfully run and upload to R2
|
||||
|
||||
## Monitoring Setup
|
||||
|
||||
To prevent future outages, consider:
|
||||
|
||||
1. **Set up alerts** for:
|
||||
- Matchmaker pod down
|
||||
- Worker pods down
|
||||
- No matches created in 1 hour
|
||||
- Failed jobs exceeding threshold
|
||||
|
||||
2. **Regular health checks**:
|
||||
- `kubectl get pods -n ai-code-battle`
|
||||
- Monitor database for stuck jobs
|
||||
- Check R2 upload success rate
|
||||
|
||||
3. **Token renewal reminders**:
|
||||
- Rackspace Spot kubeconfig tokens expire
|
||||
- Set calendar reminder for renewal 30 days before expiration
|
||||
|
||||
## Files Modified
|
||||
|
||||
- Created: `/home/coding/ai-code-battle/notes/bf-5nap.md` (this file)
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Access Rackspace Spot UI and renew iad-acb kubeconfig token
|
||||
2. Update kubeconfig on ex44 at `/home/coding/.kube/iad-acb.kubeconfig`
|
||||
3. Follow diagnostic steps above to identify why match creation stopped
|
||||
4. Restart services as needed
|
||||
5. Verify match creation resumes
|
||||
6. Close bead with retrospective
|
||||
Loading…
Add table
Reference in a new issue