- acb-matchmaker and acb-worker pods cannot schedule due to CPU exhaustion - iad-acb cluster at 99% CPU allocation (1497m/1500m) on only ready node - Second node NotReady for 7+ hours - Match pipeline non-functional: no job creation or worker execution possible - Documented resolution steps and recommended actions Co-Authored-By: Claude <noreply@anthropic.com> Bead-Id: bf-4dy
3 KiB
3 KiB
Match Pipeline Verification - bf-4dy
Date: 2026-06-27
Finding: Match Pipeline Cannot Function Due to Cluster Capacity Issue
Cluster Status
Cluster: iad-acb (Rackspace Spot HCP) Nodes: 2
prod-instance-17767388520094079: Ready, 67d oldprod-instance-17825486055310528: NotReady, 7h44m old (new node, unhealthy)
Resource Allocation
Ready node (prod-instance-17767388520094079):
- CPU allocated: 1497m (~99% of capacity)
- CPU available: ~3m (insufficient for new pods)
- Memory allocated: 1654Mi (63% of capacity)
Critical Pods - All Pending
The following pods cannot schedule due to insufficient CPU:
| Pod | CPU Request | Status |
|---|---|---|
| acb-matchmaker | 50m | Pending |
| acb-worker (x2) | 100m each | Pending |
| acb-api (x2) | unknown | Pending |
| acb-strategy-gatherer | unknown | Pending |
| acb-strategy-guardian | unknown | Pending |
| acb-evolver | unknown | Pending |
| acb-enrichment | unknown | Pending |
| acb-index-builder | unknown | Pending |
Additional CPU needed: ~250m minimum (matchmaker + 2 workers)
Running Pods (Ready Node)
| Pod | CPU Request | Status |
|---|---|---|
| acb-postgres | 50m | Running |
| acb-valkey | 25m | Running |
| acb-map-evolver | 100m | Running (liveness probe failing) |
| acb-strategy-hunter | 100m | Running |
| acb-strategy-random | 50m | Running |
| acb-strategy-rusher | 100m | Running |
| acb-strategy-swarm | 100m | Running |
| acb-schema-init | 10m | Running |
| armor | 25m | Running |
Total running: ~560m CPU
Match Pipeline Status
Cannot verify - The match pipeline is non-functional:
- acb-matchmaker is not running - cannot create match jobs
- acb-worker replicas are not running - cannot claim/execute jobs
- Valkey logs show minimal activity (1 change/hour) - no job queue activity
- No replay uploads possible - workers not running
Additional Issues
- acb-map-evolver is restarting frequently (13,202 restarts over 54 days) with liveness probe failures
- TLS certificate issue for acb-api-cert (ACME challenge failing)
Resolution Required
The match pipeline cannot be verified until:
- Fix the NotReady node - Investigate why
prod-instance-17825486055310528has been NotReady for 7+ hours - Increase cluster capacity - Add more nodes or resize existing nodes to accommodate pending pods
- Investigate acb-map-evolver - Fix the liveness probe failures causing frequent restarts
Recommended Actions
- Check node logs/events for the NotReady node to diagnose the issue
- Consider scaling up the cluster or right-sizing existing workloads
- Once capacity is available, verify matchmaker creates jobs and workers process them
- Check B2 bucket for replay uploads after workers are running
- Verify index builder rebuilds static JSON after new replays
Conclusion: The match pipeline cannot be verified because critical components (matchmaker, workers) are unable to schedule due to cluster CPU exhaustion.