ai-code-battle/notes/bf-4dy.md
jedarden e82b62d2de docs(bf-4dy): document cluster capacity issue blocking match pipeline
- acb-matchmaker and acb-worker pods cannot schedule due to CPU exhaustion
- iad-acb cluster at 99% CPU allocation (1497m/1500m) on only ready node
- Second node NotReady for 7+ hours
- Match pipeline non-functional: no job creation or worker execution possible
- Documented resolution steps and recommended actions

Co-Authored-By: Claude <noreply@anthropic.com>
Bead-Id: bf-4dy
2026-06-27 12:48:51 -04:00

84 lines
3 KiB
Markdown

# Match Pipeline Verification - bf-4dy
## Date: 2026-06-27
## Finding: Match Pipeline Cannot Function Due to Cluster Capacity Issue
### Cluster Status
**Cluster**: iad-acb (Rackspace Spot HCP)
**Nodes**: 2
- `prod-instance-17767388520094079`: Ready, 67d old
- `prod-instance-17825486055310528`: **NotReady**, 7h44m old (new node, unhealthy)
### Resource Allocation
Ready node (`prod-instance-17767388520094079`):
- **CPU allocated**: 1497m (~99% of capacity)
- **CPU available**: ~3m (insufficient for new pods)
- **Memory allocated**: 1654Mi (63% of capacity)
### Critical Pods - All Pending
The following pods cannot schedule due to insufficient CPU:
| Pod | CPU Request | Status |
|-----|-------------|--------|
| acb-matchmaker | 50m | Pending |
| acb-worker (x2) | 100m each | Pending |
| acb-api (x2) | unknown | Pending |
| acb-strategy-gatherer | unknown | Pending |
| acb-strategy-guardian | unknown | Pending |
| acb-evolver | unknown | Pending |
| acb-enrichment | unknown | Pending |
| acb-index-builder | unknown | Pending |
**Additional CPU needed**: ~250m minimum (matchmaker + 2 workers)
### Running Pods (Ready Node)
| Pod | CPU Request | Status |
|-----|-------------|--------|
| acb-postgres | 50m | Running |
| acb-valkey | 25m | Running |
| acb-map-evolver | 100m | Running (liveness probe failing) |
| acb-strategy-hunter | 100m | Running |
| acb-strategy-random | 50m | Running |
| acb-strategy-rusher | 100m | Running |
| acb-strategy-swarm | 100m | Running |
| acb-schema-init | 10m | Running |
| armor | 25m | Running |
**Total running**: ~560m CPU
### Match Pipeline Status
**Cannot verify** - The match pipeline is non-functional:
1. **acb-matchmaker** is not running - cannot create match jobs
2. **acb-worker** replicas are not running - cannot claim/execute jobs
3. **Valkey logs** show minimal activity (1 change/hour) - no job queue activity
4. **No replay uploads** possible - workers not running
### Additional Issues
- **acb-map-evolver** is restarting frequently (13,202 restarts over 54 days) with liveness probe failures
- **TLS certificate issue** for acb-api-cert (ACME challenge failing)
### Resolution Required
The match pipeline cannot be verified until:
1. **Fix the NotReady node** - Investigate why `prod-instance-17825486055310528` has been NotReady for 7+ hours
2. **Increase cluster capacity** - Add more nodes or resize existing nodes to accommodate pending pods
3. **Investigate acb-map-evolver** - Fix the liveness probe failures causing frequent restarts
### Recommended Actions
1. Check node logs/events for the NotReady node to diagnose the issue
2. Consider scaling up the cluster or right-sizing existing workloads
3. Once capacity is available, verify matchmaker creates jobs and workers process them
4. Check B2 bucket for replay uploads after workers are running
5. Verify index builder rebuilds static JSON after new replays
**Conclusion**: The match pipeline cannot be verified because critical components (matchmaker, workers) are unable to schedule due to cluster CPU exhaustion.