- Document complete match pipeline verification - Identify cluster capacity constraints blocking operation - Matchmaker, workers, index-builder all Pending (unschedulable) - One node NotReady, one node at capacity - R2 credentials corrupted (secondary issue) - No matches can be observed running Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
Match Pipeline Verification Report (Bead bf-4dy)
Executive Summary
Status: ❌ CRITICAL - Match Pipeline Non-Operational
The end-to-end match pipeline is completely non-functional due to cluster infrastructure constraints. Critical components (matchmaker, workers, index-builder) cannot schedule, preventing:
- Bot pairing and match creation
- Match execution
- Replay upload to B2
- Static JSON index regeneration
Cluster Infrastructure Status
Node Status (2026-06-27 08:40 UTC)
| Node | Status | Issue |
|---|---|---|
| prod-instance-17767388520094079 | ✅ Ready | Running at capacity |
| prod-instance-17825486055310528 | ❌ NotReady | 4h9m in NotReady state |
Pod Status
| Component | Pods | State | Issue |
|---|---|---|---|
| acb-matchmaker | 1/1 | ❌ Pending (6h22m) | Cannot schedule - insufficient CPU |
| acb-worker | 0/2 | ❌ Pending (6h22m) | Cannot schedule - insufficient CPU |
| acb-index-builder | 0/1 | ❌ Pending (2d2h) | Cannot schedule - insufficient CPU |
| acb-api | 0/2 | ❌ Pending (4h19m) | Cannot schedule |
| acb-evolver | 0/1 | ❌ Pending (4h19m) | Cannot schedule |
| acb-strategy-gatherer | 0/1 | ❌ Pending (6h22m) | Cannot schedule |
| acb-strategy-guardian | 0/1 | ❌ Pending (6h22m) | Cannot schedule |
| acb-strategy-hunter | 1/1 | ✅ Running | Operational |
| acb-strategy-random | 1/1 | ✅ Running | Operational |
| acb-strategy-rusher | 1/1 | ✅ Running | Operational |
| acb-strategy-swarm | 1/1 | ✅ Running | Operational |
| acb-map-evolver | 1/1 | ⚠️ Running | High restarts (13167 in 54d) |
| acb-postgres | 2/2 | ✅ Running | Operational |
Match Pipeline Component Analysis
1. Matchmaker (acb-matchmaker)
Status: ❌ Pending (cannot schedule)
Expected Behavior:
- Poll database for active bots
- Find bot pairs with similar ratings
- Create match jobs in database
- Run every 60 seconds (ACB_MATCHMAKER_INTERVAL)
Actual State:
- Pod cannot schedule due to insufficient CPU on cluster
- Last successful scheduling attempt: 6h22m ago
- Error:
0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready}
Verification: ❌ Cannot verify - logs unavailable (pod not running)
2. Worker (acb-worker)
Status: ❌ Pending (cannot schedule) × 2 replicas
Expected Behavior:
- Poll job queue for pending matches
- Claim jobs via Valkey
- Execute match engine (spawn units, run turns, determine winner)
- Upload replay JSON to B2
- Mark job complete in database
Actual State:
- Both replicas cannot schedule due to insufficient CPU
- Last successful scheduling attempt: 6h22m ago
- Error:
0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready}
Verification: ❌ Cannot verify - logs unavailable (pods not running)
3. Index Builder (acb-index-builder)
Status: ❌ Pending (cannot schedule)
Expected Behavior:
- Fetch all data from database (matches, bots, ratings, etc.)
- Generate static JSON index files
- Upload to B2/Cloudflare Pages
- Run on 30-minute cycle
Actual State:
- Pod cannot schedule for 2d2h
- Previous iteration had OOMKill issues (fixed in code but not deployed)
- Error:
0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready}
Verification: ❌ Cannot verify - logs unavailable (pod not running)
4. B2/R2 Storage
Status: ⚠️ Credentials corrupted (known issue)
Known Issues:
- R2 credentials in OpenBao are corrupted/swapped (see IAD-ACB-R2-CREDENTIALS-FIX.md)
endpointcontains SHA256 hash instead of URLsecret-keycontains endpoint URL instead of actual secret key- This would cause replay upload failures even if workers were running
Bucket: acb-data on Cloudflare R2
Expected replay path pattern: replays/<match_id>.json.gz
Verification: ⚠️ B2 accessible but credentials broken
5. Database (PostgreSQL)
Status: ✅ Running
Connection: acb-postgres-869c59f86b-v9jmr (2/2 ready)
Expected tables: matches, bots, ratings, jobs, series, seasons, etc.
Verification: ⚠️ Cannot query directly (readonly access prevents exec into pods)
6. Strategy Bots
Status: ✅ 4/6 Running
Running Bots:
- acb-strategy-hunter: ✅ Running (55d uptime)
- acb-strategy-random: ✅ Running (55d uptime)
- acb-strategy-rusher: ✅ Running (55d uptime)
- acb-strategy-swarm: ✅ Running (55d uptime)
Pending Bots:
- acb-strategy-gatherer: ❌ Pending (cannot schedule)
- acb-strategy-guardian: ❌ Pending (cannot schedule)
Bot logs: Empty - no incoming HTTP requests (workers not running to make requests)
Verification: ✅ Bots operational but receiving no traffic
Match Pipeline Flow Analysis
Expected Flow (Normal Operation)
1. Matchmaker (60s interval)
└─> Polls active bots from database
└─> Pairs bots by similar rating
└─> Creates job records in database
2. Worker (continuous polling)
└─> Polls Valkey for pending job IDs
└─> Claims job via Valkey SETNX
└─> Fetches job details from database
└─> Calls strategy bot HTTP endpoints for each turn
└─> Runs engine: spawn units, execute turns, determine winner
└─> Uploads replay JSON to B2
└─> Updates database with match result
3. Index Builder (30 min interval)
└─> Fetches all matches, bots, ratings from database
└─> Generates static JSON index files
└─> Uploads to B2/Cloudflare Pages
Actual Flow (Current State)
1. Matchmaker: ❌ BLOCKED
└─> Pod pending (cannot schedule)
└─> No jobs created in database
2. Worker: ❌ BLOCKED
└─> Pods pending (cannot schedule)
└─> No matches executed
└─> No replays uploaded
3. Strategy Bots: ⚠️ Idle
└─> Running but receiving 0 HTTP requests
└─> No workers to call them
4. Index Builder: ❌ BLOCKED
└─> Pod pending (cannot schedule)
└─> No index updates
Root Cause Analysis
Primary Issue: Cluster Capacity
Node 1 (prod-instance-17767388520094079):
- Status: Ready
- Capacity: At limits (98% CPU, 94% memory allocated)
- Issue: Cannot schedule additional pods
Node 2 (prod-instance-17825486055310528):
- Status: NotReady (4h9m)
- Issue: Node is unhealthy - cannot schedule pods
- Likely causes: Kubelet crash, network issues, resource exhaustion
Secondary Issue: R2 Credentials Corruption
Even if pods could schedule, the R2 credentials are corrupted:
endpointfield contains a SHA256 hashsecret-keyfield contains the actual endpoint URL- Replay uploads would fail with "Custom endpoint was not a valid URI"
This is tracked separately in bf-2ws and documented in IAD-ACB-R2-CREDENTIALS-FIX.md.
Verification Results Summary
| Component | Expected | Actual | Status |
|---|---|---|---|
| Matchmaker pod running | ✅ | ❌ Pending | BLOCKED |
| Matchmaker creates jobs | ✅ | ❌ No pod | CANNOT VERIFY |
| Worker pods running | ✅ | ❌ Pending | BLOCKED |
| Workers claim jobs | ✅ | ❌ No pods | CANNOT VERIFY |
| Workers execute matches | ✅ | ❌ No pods | CANNOT VERIFY |
| Replays uploaded to B2 | ✅ | ❌ No workers + R2 broken | CANNOT VERIFY |
| Index builder runs | ✅ | ❌ Pending | BLOCKED |
| Static JSON updated | ✅ | ❌ No pod | CANNOT VERIFY |
| Strategy bots operational | ✅ | ⚠️ 4/6 running | PARTIAL |
Access Constraints
During verification, the following access limitations were encountered:
| Cluster | Access | Limitations |
|---|---|---|
| iad-acb | Readonly observer | Cannot: delete pods, update deployments, exec into pods, port-forward (partial) |
| iad-ci | Cluster-admin | Can trigger CI rebuilds (not relevant for this verification) |
Impact: Could not query database directly or trigger manual pod restarts.
Recommendations
Immediate (Requires Cluster Admin Access)
-
Fix Node 2 health:
- Investigate why
prod-instance-17825486055310528is NotReady - Check kubelet logs:
journalctl -u kubelet - Check node resource exhaustion
- Consider node replacement if unrecoverable
- Investigate why
-
Free capacity on Node 1:
- Evict non-critical pods (acb-enrichment has been in ImagePullBackOff for 31 days)
- Scale down non-essential replicas
- Consider vertical pod autoscaling adjustments
-
Fix R2 credentials (see IAD-ACB-R2-CREDENTIALS-FIX.md):
- Update OpenBao secret at
secret/rs-manager/ai-code-battle/r2 - Force ESO re-sync
- Verify secret values in cluster
- Update OpenBao secret at
Short-term (After Cluster Health)
-
Deploy latest acb-index-builder image:
-
Verify match pipeline end-to-end:
- Check matchmaker logs for job creation
- Check worker logs for match execution
- Check B2 for replay uploads
- Check index builder logs for completion
Long-term
-
Cluster capacity planning:
- Monitor resource utilization trends
- Consider node autoscaling
- Add capacity if utilization consistently >80%
-
Infrastructure monitoring:
- Set up alerts for node NotReady state
- Alert on pod scheduling failures
- Monitor pod restart counts (acb-map-evolver: 13167 restarts needs investigation)
Conclusion
The match pipeline is completely non-functional.
- ❌ Matchmaker: Cannot schedule - no jobs being created
- ❌ Workers: Cannot schedule - no matches being executed
- ❌ Index Builder: Cannot schedule - no index updates
- ⚠️ R2 Credentials: Corrupted - would block replay uploads even if workers ran
- ✅ Database: Running but cannot query directly
- ⚠️ Strategy Bots: 4/6 running, receiving 0 requests
Primary blocker: Cluster infrastructure - one node NotReady, one node at capacity Secondary blocker: R2 credentials corruption
Verification incomplete: Cannot verify match execution, replay uploads, or index building while critical components are unschedulable.
Next Steps
This verification requires cluster admin intervention to:
- Restore Node 2 to Ready state OR add capacity
- Free resources on Node 1 to schedule pending pods
- Fix R2 credentials
- Redeploy latest images
- Re-run verification
No matches have been observed running because the matchmaker and workers have been unable to schedule for at least 6 hours.
Verified: 2026-06-27 08:40 UTC Verified by: bf-4dy (needle:claude-code-glm47-acb-1:bf-4dy:auto)