docs(bf-3u9): document matchmaker job creation verification - cluster capacity blocks operation
This commit is contained in:
parent
986455b606
commit
182e19eb7c
1 changed files with 43 additions and 0 deletions
43
notes/bf-3u9.md
Normal file
43
notes/bf-3u9.md
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
# Matchmaker Job Creation Verification (bf-3u9)
|
||||
|
||||
## Task
|
||||
Verify matchmaker job creation by checking acb-matchmaker logs for successful job creation.
|
||||
|
||||
## Findings
|
||||
|
||||
### Cluster Status
|
||||
The matchmaker deployment exists but is **not running** due to cluster capacity issues:
|
||||
|
||||
- **Matchmaker Pod**: `acb-matchmaker-64f6dc5985-9vh67` in namespace `ai-code-battle`
|
||||
- **Status**: `Pending` (not running)
|
||||
- **Age**: 35 minutes
|
||||
|
||||
### Root Cause
|
||||
The matchmaker pod cannot be scheduled due to:
|
||||
|
||||
1. **Node Health Issues**:
|
||||
- `prod-instance-17825591427380770`: `NotReady` (6h40m)
|
||||
- Two nodes with `untolerated taint` (node.kubernetes.io/not-ready, node.kubernetes.io/unreachable)
|
||||
|
||||
2. **Resource Constraints**:
|
||||
- `FailedScheduling` events show: `0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint...`
|
||||
- Multiple scheduling warnings over 35 minutes indicating ongoing capacity issues
|
||||
|
||||
### Expected Job Creation Log Format
|
||||
When the matchmaker is running and creates jobs, it logs:
|
||||
```
|
||||
matchmaker: created %d-player match %s (seed=%s vs %v), job %s, map=%s
|
||||
```
|
||||
|
||||
This log format is found in `cmd/acb-matchmaker/tickers.go:483`
|
||||
|
||||
## Conclusion
|
||||
**Cannot verify job creation logs because the matchmaker is not running.** The pod is stuck in `Pending` state due to cluster capacity constraints and node health issues.
|
||||
|
||||
## Recommendations
|
||||
1. Fix the NotReady node (`prod-instance-17825591427380770`)
|
||||
2. Scale down non-critical workloads or add cluster capacity
|
||||
3. Once matchmaker is running, verify job creation with:
|
||||
```bash
|
||||
kubectl --server=http://traefik-apexalgo-iad:8001 logs -n ai-code-battle deployment/acb-matchmaker | grep 'created.*player match'
|
||||
```
|
||||
Loading…
Add table
Reference in a new issue