Matches stopped being created after 2026-05-09. The iad-acb kubeconfig on ex44 has expired credentials, preventing cluster access for diagnosis. Created comprehensive diagnostic documentation covering: - Cluster architecture and components (matchmaker, workers) - Step-by-step diagnostic procedures for kubectl access - Pod status checks and log analysis commands - Database verification queries - Service restart procedures - Known issues from prior incidents (R2 credentials corruption) Next steps: 1. Renew iad-acb token from Rackspace Spot UI 2. Update kubeconfig on ex44 3. Execute diagnostic commands to identify root cause 4. Restart services as needed 5. Verify match creation resumes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 KiB
Bug Fix bf-5nap: Match Creation Stopped - Investigation Report
Summary
Matches stopped being created after 2026-05-09T13:29:34Z (1000 matches total, May 8-9). The iad-acb kubeconfig on ex44 has expired credentials, preventing access to the production cluster.
Problem Analysis
Timeline
- Last successful match: 2026-05-09T13:29:34Z
- Total matches created: 1000 (May 8-9)
- Current date: 2026-05-13
- Duration of outage: ~4 days
Root Cause (Suspected)
The iad-acb Kubernetes cluster kubeconfig on ex44 has expired credentials. The server is asking for client credentials, indicating the authentication token has expired.
Note: This is a different issue from the previous R2 credentials corruption (documented in IAD-ACB-R2-CREDENTIALS-FIX.md and IAD-ACB-OPENBAO-FIX.md).
Cluster Architecture
iad-acb Cluster Components
-
acb-matchmaker (Deployment, 1 replica)
- Computes pairings
- Enqueues job IDs into Valkey
- Health-checks bots
- Reaps stale jobs
- Image:
ronaldraygun/acb-matchmaker@sha256:1a322b94e32e6cd843abe3c2beb1478f2c4893ce5d963a8d2eeff92cfe7c0e06
-
acb-worker (Deployment, 2 replicas)
- BRPOPs jobs from Valkey
- Runs matches
- Uploads replays to B2 (armor)
- Writes results and Glicko-2 ratings to PostgreSQL
- Image:
ronaldraygun/acb-worker@sha256:edd9616aaefb684a59779ea4b46b2bfe72679eecf6867e1be658273648e86bbe
Dependencies
- PostgreSQL:
acb-postgres:5432 - Valkey:
valkey:6379 - Armor (B2):
armor:9000
Diagnostic Steps Required
Step 1: Renew iad-acb Token from Rackspace Spot UI
The kubeconfig token needs to be renewed from the Rackspace Spot dashboard:
- Log in to Rackspace Spot dashboard
- Navigate to Kubernetes clusters
- Locate the iad-acb cluster
- Verify the cluster still exists (may have been terminated)
- Generate/download new kubeconfig
- Update
/home/coding/.kube/iad-acb.kubeconfigon ex44
Step 2: Verify Cluster Access
Once the kubeconfig is updated:
# On ex44 server
export KUBECONFIG=/home/coding/.kube/iad-acb.kubeconfig
# Test cluster access
kubectl cluster-info
kubectl get nodes
# Check namespace
kubectl get namespace ai-code-battle
Step 3: Check Matchmaker Pod Status
# Check matchmaker deployment
kubectl get deployment acb-matchmaker -n ai-code-battle
# Check matchmaker pods
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker
# Check matchmaker logs
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker --tail=100
# Check for crash loops
kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker
Expected findings:
- Pod may be in CrashLoopBackOff or Error state
- Logs may show authentication errors or database connection issues
- Pod may be stuck trying to connect to PostgreSQL or Valkey
Step 4: Check Worker Pod Status
# Check worker deployment
kubectl get deployment acb-worker -n ai-code-battle
# Check worker pods
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-worker
# Check worker logs
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker --tail=100
# Check for crash loops
kubectl describe pod -n ai-code-battle -l app.kubernetes.io/name=acb-worker
Expected findings:
- Workers may be idle (no jobs from matchmaker)
- May show R2/armor connection issues
- May show database connection errors
Step 5: Check Dependencies
# Check PostgreSQL
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-postgres
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-postgres --tail=50
# Check Valkey
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=valkey
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=valkey --tail=50
# Check Armor (B2 gateway)
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=armor
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=armor --tail=50
Step 6: Check Database State
# Access PostgreSQL
kubectl exec -it -n ai-code-battle deployment/acb-postgres -- psql -U postgres -d ai_code_battle
# In psql, check:
-- Last match created
SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 5;
-- Check for failed jobs
SELECT * FROM jobs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 10;
-- Check for stuck jobs
SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10;
-- Check bot health
SELECT * FROM bots ORDER BY last_health_check DESC;
Step 7: Restart Services (If Needed)
# Restart matchmaker
kubectl rollout restart deployment/acb-matchmaker -n ai-code-battle
# Restart workers
kubectl rollout restart deployment/acb-worker -n ai-code-battle
# Watch rollout status
kubectl rollout status deployment/acb-matchmaker -n ai-code-battle
kubectl rollout status deployment/acb-worker -n ai-code-battle
Step 8: Verify Match Creation Resumes
# Watch matchmaker logs for activity
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-matchmaker -f
# In PostgreSQL, verify new matches are being created
# Run every 30 seconds:
SELECT id, created_at FROM matches ORDER BY created_at DESC LIMIT 1;
# Check worker activity
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-worker -f
Potential Issues
Issue 1: Cluster Terminated
Symptoms: kubectl cluster-info fails with connection refused
Resolution: Cluster may have been terminated in Rackspace Spot. Need to recreate cluster and restore from backups.
Issue 2: Pod Image Pull Errors
Symptoms: Pods stuck in ImagePullBackOff state
Resolution: Check Docker Hub credentials, verify image tags exist, update imagePullSecrets
Issue 3: Database Connection Failures
Symptoms: Logs show "connection refused" to PostgreSQL
Resolution: Check PostgreSQL pod is running, verify credentials in acb-postgres-credentials secret
Issue 4: Valkey Connection Failures
Symptoms: Matchmaker can't enqueue jobs Resolution: Check Valkey pod is running, verify network policies allow traffic
Issue 5: R2/Armor Connection Failures
Symptoms: Workers can't upload replays Resolution: Check R2 credentials (see IAD-ACB-R2-CREDENTIALS-FIX.md), verify armor pod is running
Known Issues from Prior Incidents
-
R2 Credentials Corruption (IAD-ACB-R2-CREDENTIALS-FIX.md)
- OpenBao secret at
secret/rs-manager/ai-code-battle/r2has corrupted values - Endpoint and secret-key values are swapped
- Fix: Run
/home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh
- OpenBao secret at
-
Orphaned openbao Namespace (IAD-ACB-OPENBAO-FIX.md)
- Status: RESOLVED
- Was causing DNS conflicts for ESO
- Namespace has been deleted
Verification Checklist
After fixing the issue, verify:
- iad-acb cluster is accessible via kubectl
- Matchmaker pod is running and healthy
- Worker pods are running and healthy
- PostgreSQL is accepting connections
- Valkey is accepting connections
- Armor (B2 gateway) is accessible
- New matches are being created in the database
- Workers are processing matches and uploading replays
- No errors in matchmaker or worker logs
- Index builder can successfully run and upload to R2
Monitoring Setup
To prevent future outages, consider:
-
Set up alerts for:
- Matchmaker pod down
- Worker pods down
- No matches created in 1 hour
- Failed jobs exceeding threshold
-
Regular health checks:
kubectl get pods -n ai-code-battle- Monitor database for stuck jobs
- Check R2 upload success rate
-
Token renewal reminders:
- Rackspace Spot kubeconfig tokens expire
- Set calendar reminder for renewal 30 days before expiration
Files Modified
- Created:
/home/coding/ai-code-battle/notes/bf-5nap.md(this file)
Next Steps
- Access Rackspace Spot UI and renew iad-acb kubeconfig token
- Update kubeconfig on ex44 at
/home/coding/.kube/iad-acb.kubeconfig - Follow diagnostic steps above to identify why match creation stopped
- Restart services as needed
- Verify match creation resumes
- Close bead with retrospective