docs(bf-3u9): document matchmaker job creation verification failure
- Cluster capacity insufficient to schedule acb-matchmaker pod - All ACB pods stuck in Pending state due to insufficient CPU - No jobs exist because matchmaker has never been able to start - Verification cannot complete until cluster capacity is restored - One node NotReady (prod-instance-17825591427380770) - Total pending CPU requests: ~2250m vs ~4181m available (but fragmentation/blocking)
This commit is contained in:
parent
c5bef98747
commit
d7f5bd7e7f
2 changed files with 82 additions and 31 deletions
|
|
@ -1 +1 @@
|
|||
df02aff045df0a5e300163e08d75d260f03e2837
|
||||
152b4a76b81ac96b1b8e3f267d7c6366e91ed403
|
||||
|
|
|
|||
111
notes/bf-3u9.md
111
notes/bf-3u9.md
|
|
@ -1,43 +1,94 @@
|
|||
# Matchmaker Job Creation Verification (bf-3u9)
|
||||
# Matchmaker Job Creation Verification - bf-3u9
|
||||
|
||||
## Task
|
||||
Verify matchmaker job creation by checking acb-matchmaker logs for successful job creation.
|
||||
**Date:** 2026-06-27
|
||||
**Cluster:** apexalgo-iad
|
||||
**Namespace:** ai-code-battle
|
||||
|
||||
## Findings
|
||||
## Critical Finding: Cluster Capacity Blocks Job Creation
|
||||
|
||||
### Cluster Status
|
||||
The matchmaker deployment exists but is **not running** due to cluster capacity issues:
|
||||
The acb-matchmaker logs **cannot be checked** because the matchmaker pod has never been able to start. All pods in the ai-code-battle namespace are stuck in Pending state due to insufficient cluster CPU capacity.
|
||||
|
||||
- **Matchmaker Pod**: `acb-matchmaker-64f6dc5985-9vh67` in namespace `ai-code-battle`
|
||||
- **Status**: `Pending` (not running)
|
||||
- **Age**: 35 minutes
|
||||
## Current Cluster Status
|
||||
|
||||
### Root Cause
|
||||
The matchmaker pod cannot be scheduled due to:
|
||||
### Nodes (3 total)
|
||||
- **prod-instance-17781842321795040**: Ready, 32% CPU (1152m/3500m used), 15% memory
|
||||
- **prod-instance-17825487911280674**: Ready, 47% CPU (1667m/3500m used), 65% memory
|
||||
- **prod-instance-17825591427380770**: **NotReady**, 2% CPU (83m), 12% memory
|
||||
|
||||
1. **Node Health Issues**:
|
||||
- `prod-instance-17825591427380770`: `NotReady` (6h40m)
|
||||
- Two nodes with `untolerated taint` (node.kubernetes.io/not-ready, node.kubernetes.io/unreachable)
|
||||
### Pod Status
|
||||
- **Running**: Only `acb-schema-init-5b698c549d-wzhnc` (1/1)
|
||||
- **Pending**: All other pods including:
|
||||
- `acb-matchmaker-64f6dc5985-9vh67` (pending for 63+ minutes)
|
||||
- `acb-api-5646489f75-fs7wx`
|
||||
- `acb-worker-bf5bfdb98-68k4r`
|
||||
- 8 bot strategy pods (random, rusher, gatherer, guardian, hunter, swarm, farmer)
|
||||
- `acb-evolver`, `acb-enrichment`, `acb-index-builder`
|
||||
|
||||
2. **Resource Constraints**:
|
||||
- `FailedScheduling` events show: `0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint...`
|
||||
- Multiple scheduling warnings over 35 minutes indicating ongoing capacity issues
|
||||
### Job Creation Status
|
||||
**No jobs exist** in the ai-code-battle namespace. Job creation cannot occur because:
|
||||
1. The matchmaker pod cannot schedule due to insufficient CPU
|
||||
2. Even if scheduled, the matchmaker requires PostgreSQL connection (from pending pods)
|
||||
3. Workers are also pending, so no jobs could execute even if created
|
||||
|
||||
### Expected Job Creation Log Format
|
||||
When the matchmaker is running and creates jobs, it logs:
|
||||
## Scheduling Failure Details
|
||||
|
||||
All pending pods show this pattern:
|
||||
```
|
||||
matchmaker: created %d-player match %s (seed=%s vs %v), job %s, map=%s
|
||||
0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu
|
||||
```
|
||||
|
||||
This log format is found in `cmd/acb-matchmaker/tickers.go:483`
|
||||
The `NotReady` node (`prod-instance-17825591427380770`) appears to be a newly added node (7h8m old) that may still be initializing or has issues.
|
||||
|
||||
## Conclusion
|
||||
**Cannot verify job creation logs because the matchmaker is not running.** The pod is stuck in `Pending` state due to cluster capacity constraints and node health issues.
|
||||
## Resource Analysis
|
||||
|
||||
## Recommendations
|
||||
1. Fix the NotReady node (`prod-instance-17825591427380770`)
|
||||
2. Scale down non-critical workloads or add cluster capacity
|
||||
3. Once matchmaker is running, verify job creation with:
|
||||
```bash
|
||||
kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-matchmaker | grep 'created.*player match'
|
||||
```
|
||||
### Available CPU (Ready nodes only)
|
||||
- Node 1: ~2348m available (3500m - 1152m used)
|
||||
- Node 2: ~1833m available (3500m - 1667m used)
|
||||
- **Total available: ~4181m CPU**
|
||||
|
||||
### Pending pod CPU requests (estimated)
|
||||
- acb-matchmaker: 50m
|
||||
- acb-api (2 pods): 200m
|
||||
- acb-enrichment (2 pods): 400m
|
||||
- acb-evolver (2 pods): 1000m
|
||||
- acb-worker (2 pods): ~200m
|
||||
- 8 bot strategy pods: ~400m
|
||||
- acb-index-builder: 50m
|
||||
- **Total requests: ~2250m**
|
||||
|
||||
Theoretically there should be enough CPU (~4181m available vs ~2250m needed), but scheduler reports insufficient CPU. This suggests:
|
||||
1. Other workloads on the cluster consuming CPU not shown in `kubectl top nodes`
|
||||
2. Resource fragmentation preventing scheduling of larger pods
|
||||
3. The NotReady node blocking some scheduling attempts
|
||||
|
||||
## Verification Conclusion
|
||||
|
||||
**Status: ❌ VERIFICATION FAILED - Infrastructure Issue**
|
||||
|
||||
The matchmaker job creation cannot be verified because:
|
||||
1. **Cluster capacity insufficient** - Matchmaker pod cannot schedule
|
||||
2. **No jobs in queue** - Query returns 0 jobs (expected since matchmaker never ran)
|
||||
3. **No logs available** - Pod never started, so no logs to check
|
||||
|
||||
## Next Steps Required
|
||||
|
||||
1. **Fix cluster capacity** - Either:
|
||||
- Add more nodes to the cluster
|
||||
- Scale down resource requests for ACB pods
|
||||
- Move other workloads off apexalgo-iad to free capacity
|
||||
|
||||
2. **Fix NotReady node** - Investigate why `prod-instance-17825591427380770` is NotReady
|
||||
|
||||
3. **Re-deploy ACB stack** - Once capacity is available, delete and recreate pods
|
||||
|
||||
4. **Re-run verification** - Check matchmaker logs after pods are running
|
||||
|
||||
## Acceptance Criteria Status
|
||||
|
||||
- ❌ acb-matchmaker logs show successful job creation - **CANNOT VERIFY** (pod never started)
|
||||
- ❌ Jobs appear in the queue with valid bot pairs - **NO JOBS** (matchmaker never ran)
|
||||
- ❌ No errors in matchmaker scheduling logic - **CANNOT VERIFY** (no logs)
|
||||
|
||||
## Recommendation
|
||||
|
||||
This verification should be **re-attempted** after cluster capacity is restored. The current apexalgo-iad cluster appears under-provisioned for the ACB workload.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue