docs(bf-4dy): add match pipeline verification report

- Document complete match pipeline verification
- Identify cluster capacity constraints blocking operation
- Matchmaker, workers, index-builder all Pending (unschedulable)
- One node NotReady, one node at capacity
- R2 credentials corrupted (secondary issue)
- No matches can be observed running

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-06-27 08:40:42 -04:00
parent c7cd5ecf73
commit d40afad625

View file

@ -0,0 +1,330 @@
# Match Pipeline Verification Report (Bead bf-4dy)
## Executive Summary
**Status: ❌ CRITICAL - Match Pipeline Non-Operational**
The end-to-end match pipeline is **completely non-functional** due to cluster infrastructure constraints. Critical components (matchmaker, workers, index-builder) cannot schedule, preventing:
- Bot pairing and match creation
- Match execution
- Replay upload to B2
- Static JSON index regeneration
## Cluster Infrastructure Status
### Node Status (2026-06-27 08:40 UTC)
| Node | Status | Issue |
|------|--------|-------|
| prod-instance-17767388520094079 | ✅ Ready | Running at capacity |
| prod-instance-17825486055310528 | ❌ NotReady | 4h9m in NotReady state |
### Pod Status
| Component | Pods | State | Issue |
|-----------|------|-------|-------|
| **acb-matchmaker** | 1/1 | ❌ Pending (6h22m) | Cannot schedule - insufficient CPU |
| **acb-worker** | 0/2 | ❌ Pending (6h22m) | Cannot schedule - insufficient CPU |
| **acb-index-builder** | 0/1 | ❌ Pending (2d2h) | Cannot schedule - insufficient CPU |
| acb-api | 0/2 | ❌ Pending (4h19m) | Cannot schedule |
| acb-evolver | 0/1 | ❌ Pending (4h19m) | Cannot schedule |
| acb-strategy-gatherer | 0/1 | ❌ Pending (6h22m) | Cannot schedule |
| acb-strategy-guardian | 0/1 | ❌ Pending (6h22m) | Cannot schedule |
| **acb-strategy-hunter** | 1/1 | ✅ Running | Operational |
| **acb-strategy-random** | 1/1 | ✅ Running | Operational |
| **acb-strategy-rusher** | 1/1 | ✅ Running | Operational |
| **acb-strategy-swarm** | 1/1 | ✅ Running | Operational |
| acb-map-evolver | 1/1 | ⚠️ Running | High restarts (13167 in 54d) |
| acb-postgres | 2/2 | ✅ Running | Operational |
## Match Pipeline Component Analysis
### 1. Matchmaker (`acb-matchmaker`)
**Status:** ❌ Pending (cannot schedule)
**Expected Behavior:**
- Poll database for active bots
- Find bot pairs with similar ratings
- Create match jobs in database
- Run every 60 seconds (ACB_MATCHMAKER_INTERVAL)
**Actual State:**
- Pod cannot schedule due to insufficient CPU on cluster
- Last successful scheduling attempt: 6h22m ago
- Error: `0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready}`
**Verification:** ❌ Cannot verify - logs unavailable (pod not running)
---
### 2. Worker (`acb-worker`)
**Status:** ❌ Pending (cannot schedule) × 2 replicas
**Expected Behavior:**
- Poll job queue for pending matches
- Claim jobs via Valkey
- Execute match engine (spawn units, run turns, determine winner)
- Upload replay JSON to B2
- Mark job complete in database
**Actual State:**
- Both replicas cannot schedule due to insufficient CPU
- Last successful scheduling attempt: 6h22m ago
- Error: `0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready}`
**Verification:** ❌ Cannot verify - logs unavailable (pods not running)
---
### 3. Index Builder (`acb-index-builder`)
**Status:** ❌ Pending (cannot schedule)
**Expected Behavior:**
- Fetch all data from database (matches, bots, ratings, etc.)
- Generate static JSON index files
- Upload to B2/Cloudflare Pages
- Run on 30-minute cycle
**Actual State:**
- Pod cannot schedule for 2d2h
- Previous iteration had OOMKill issues (fixed in code but not deployed)
- Error: `0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready}`
**Verification:** ❌ Cannot verify - logs unavailable (pod not running)
---
### 4. B2/R2 Storage
**Status:** ⚠️ Credentials corrupted (known issue)
**Known Issues:**
- R2 credentials in OpenBao are corrupted/swapped (see IAD-ACB-R2-CREDENTIALS-FIX.md)
- `endpoint` contains SHA256 hash instead of URL
- `secret-key` contains endpoint URL instead of actual secret key
- This would cause replay upload failures even if workers were running
**Bucket:** `acb-data` on Cloudflare R2
**Expected replay path pattern:** `replays/<match_id>.json.gz`
**Verification:** ⚠️ B2 accessible but credentials broken
---
### 5. Database (PostgreSQL)
**Status:** ✅ Running
**Connection:** `acb-postgres-869c59f86b-v9jmr` (2/2 ready)
**Expected tables:** matches, bots, ratings, jobs, series, seasons, etc.
**Verification:** ⚠️ Cannot query directly (readonly access prevents exec into pods)
---
### 6. Strategy Bots
**Status:** ✅ 4/6 Running
**Running Bots:**
- acb-strategy-hunter: ✅ Running (55d uptime)
- acb-strategy-random: ✅ Running (55d uptime)
- acb-strategy-rusher: ✅ Running (55d uptime)
- acb-strategy-swarm: ✅ Running (55d uptime)
**Pending Bots:**
- acb-strategy-gatherer: ❌ Pending (cannot schedule)
- acb-strategy-guardian: ❌ Pending (cannot schedule)
**Bot logs:** Empty - no incoming HTTP requests (workers not running to make requests)
**Verification:** ✅ Bots operational but receiving no traffic
---
## Match Pipeline Flow Analysis
### Expected Flow (Normal Operation)
```
1. Matchmaker (60s interval)
└─> Polls active bots from database
└─> Pairs bots by similar rating
└─> Creates job records in database
2. Worker (continuous polling)
└─> Polls Valkey for pending job IDs
└─> Claims job via Valkey SETNX
└─> Fetches job details from database
└─> Calls strategy bot HTTP endpoints for each turn
└─> Runs engine: spawn units, execute turns, determine winner
└─> Uploads replay JSON to B2
└─> Updates database with match result
3. Index Builder (30 min interval)
└─> Fetches all matches, bots, ratings from database
└─> Generates static JSON index files
└─> Uploads to B2/Cloudflare Pages
```
### Actual Flow (Current State)
```
1. Matchmaker: ❌ BLOCKED
└─> Pod pending (cannot schedule)
└─> No jobs created in database
2. Worker: ❌ BLOCKED
└─> Pods pending (cannot schedule)
└─> No matches executed
└─> No replays uploaded
3. Strategy Bots: ⚠️ Idle
└─> Running but receiving 0 HTTP requests
└─> No workers to call them
4. Index Builder: ❌ BLOCKED
└─> Pod pending (cannot schedule)
└─> No index updates
```
---
## Root Cause Analysis
### Primary Issue: Cluster Capacity
**Node 1 (prod-instance-17767388520094079):**
- Status: Ready
- Capacity: At limits (98% CPU, 94% memory allocated)
- Issue: Cannot schedule additional pods
**Node 2 (prod-instance-17825486055310528):**
- Status: NotReady (4h9m)
- Issue: Node is unhealthy - cannot schedule pods
- Likely causes: Kubelet crash, network issues, resource exhaustion
### Secondary Issue: R2 Credentials Corruption
Even if pods could schedule, the R2 credentials are corrupted:
- `endpoint` field contains a SHA256 hash
- `secret-key` field contains the actual endpoint URL
- Replay uploads would fail with "Custom endpoint was not a valid URI"
This is tracked separately in bf-2ws and documented in IAD-ACB-R2-CREDENTIALS-FIX.md.
---
## Verification Results Summary
| Component | Expected | Actual | Status |
|-----------|----------|--------|--------|
| Matchmaker pod running | ✅ | ❌ Pending | BLOCKED |
| Matchmaker creates jobs | ✅ | ❌ No pod | CANNOT VERIFY |
| Worker pods running | ✅ | ❌ Pending | BLOCKED |
| Workers claim jobs | ✅ | ❌ No pods | CANNOT VERIFY |
| Workers execute matches | ✅ | ❌ No pods | CANNOT VERIFY |
| Replays uploaded to B2 | ✅ | ❌ No workers + R2 broken | CANNOT VERIFY |
| Index builder runs | ✅ | ❌ Pending | BLOCKED |
| Static JSON updated | ✅ | ❌ No pod | CANNOT VERIFY |
| Strategy bots operational | ✅ | ⚠️ 4/6 running | PARTIAL |
---
## Access Constraints
During verification, the following access limitations were encountered:
| Cluster | Access | Limitations |
|---------|--------|-------------|
| iad-acb | Readonly observer | Cannot: delete pods, update deployments, exec into pods, port-forward (partial) |
| iad-ci | Cluster-admin | Can trigger CI rebuilds (not relevant for this verification) |
**Impact:** Could not query database directly or trigger manual pod restarts.
---
## Recommendations
### Immediate (Requires Cluster Admin Access)
1. **Fix Node 2 health:**
- Investigate why `prod-instance-17825486055310528` is NotReady
- Check kubelet logs: `journalctl -u kubelet`
- Check node resource exhaustion
- Consider node replacement if unrecoverable
2. **Free capacity on Node 1:**
- Evict non-critical pods (acb-enrichment has been in ImagePullBackOff for 31 days)
- Scale down non-essential replicas
- Consider vertical pod autoscaling adjustments
3. **Fix R2 credentials (see IAD-ACB-R2-CREDENTIALS-FIX.md):**
- Update OpenBao secret at `secret/rs-manager/ai-code-battle/r2`
- Force ESO re-sync
- Verify secret values in cluster
### Short-term (After Cluster Health)
1. **Deploy latest acb-index-builder image:**
- Current deployed image: b35a2aa (first OOM fix only)
- Latest code: 05512a5 (all OOM fixes)
- Trigger CI rebuild on iad-ci
2. **Verify match pipeline end-to-end:**
- Check matchmaker logs for job creation
- Check worker logs for match execution
- Check B2 for replay uploads
- Check index builder logs for completion
### Long-term
1. **Cluster capacity planning:**
- Monitor resource utilization trends
- Consider node autoscaling
- Add capacity if utilization consistently >80%
2. **Infrastructure monitoring:**
- Set up alerts for node NotReady state
- Alert on pod scheduling failures
- Monitor pod restart counts (acb-map-evolver: 13167 restarts needs investigation)
---
## Conclusion
**The match pipeline is completely non-functional.**
- ❌ **Matchmaker**: Cannot schedule - no jobs being created
- ❌ **Workers**: Cannot schedule - no matches being executed
- ❌ **Index Builder**: Cannot schedule - no index updates
- ⚠️ **R2 Credentials**: Corrupted - would block replay uploads even if workers ran
- ✅ **Database**: Running but cannot query directly
- ⚠️ **Strategy Bots**: 4/6 running, receiving 0 requests
**Primary blocker:** Cluster infrastructure - one node NotReady, one node at capacity
**Secondary blocker:** R2 credentials corruption
**Verification incomplete:** Cannot verify match execution, replay uploads, or index building while critical components are unschedulable.
---
## Next Steps
This verification requires cluster admin intervention to:
1. Restore Node 2 to Ready state OR add capacity
2. Free resources on Node 1 to schedule pending pods
3. Fix R2 credentials
4. Redeploy latest images
5. Re-run verification
**No matches have been observed running** because the matchmaker and workers have been unable to schedule for at least 6 hours.
---
**Verified:** 2026-06-27 08:40 UTC
**Verified by:** bf-4dy (needle:claude-code-glm47-acb-1:bf-4dy:auto)