jedarden e82b62d2de docs(bf-4dy): document cluster capacity issue blocking match pipeline

- acb-matchmaker and acb-worker pods cannot schedule due to CPU exhaustion
- iad-acb cluster at 99% CPU allocation (1497m/1500m) on only ready node
- Second node NotReady for 7+ hours
- Match pipeline non-functional: no job creation or worker execution possible
- Documented resolution steps and recommended actions

Co-Authored-By: Claude <noreply@anthropic.com>
Bead-Id: bf-4dy

2026-06-27 12:48:51 -04:00

3 KiB

Raw Blame History

Match Pipeline Verification - bf-4dy

Date: 2026-06-27

Finding: Match Pipeline Cannot Function Due to Cluster Capacity Issue

Cluster Status

Cluster: iad-acb (Rackspace Spot HCP) Nodes: 2

prod-instance-17767388520094079: Ready, 67d old
prod-instance-17825486055310528: NotReady, 7h44m old (new node, unhealthy)

Resource Allocation

Ready node (prod-instance-17767388520094079):

CPU allocated: 1497m (~99% of capacity)
CPU available: ~3m (insufficient for new pods)
Memory allocated: 1654Mi (63% of capacity)

Critical Pods - All Pending

The following pods cannot schedule due to insufficient CPU:

Pod	CPU Request	Status
acb-matchmaker	50m	Pending
acb-worker (x2)	100m each	Pending
acb-api (x2)	unknown	Pending
acb-strategy-gatherer	unknown	Pending
acb-strategy-guardian	unknown	Pending
acb-evolver	unknown	Pending
acb-enrichment	unknown	Pending
acb-index-builder	unknown	Pending

Additional CPU needed: ~250m minimum (matchmaker + 2 workers)

Running Pods (Ready Node)

Pod	CPU Request	Status
acb-postgres	50m	Running
acb-valkey	25m	Running
acb-map-evolver	100m	Running (liveness probe failing)
acb-strategy-hunter	100m	Running
acb-strategy-random	50m	Running
acb-strategy-rusher	100m	Running
acb-strategy-swarm	100m	Running
acb-schema-init	10m	Running
armor	25m	Running

Total running: ~560m CPU

Match Pipeline Status

Cannot verify - The match pipeline is non-functional:

acb-matchmaker is not running - cannot create match jobs
acb-worker replicas are not running - cannot claim/execute jobs
Valkey logs show minimal activity (1 change/hour) - no job queue activity
No replay uploads possible - workers not running

Additional Issues

acb-map-evolver is restarting frequently (13,202 restarts over 54 days) with liveness probe failures
TLS certificate issue for acb-api-cert (ACME challenge failing)

Resolution Required

The match pipeline cannot be verified until:

Fix the NotReady node - Investigate why prod-instance-17825486055310528 has been NotReady for 7+ hours
Increase cluster capacity - Add more nodes or resize existing nodes to accommodate pending pods
Investigate acb-map-evolver - Fix the liveness probe failures causing frequent restarts

Recommended Actions

Check node logs/events for the NotReady node to diagnose the issue
Consider scaling up the cluster or right-sizing existing workloads
Once capacity is available, verify matchmaker creates jobs and workers process them
Check B2 bucket for replay uploads after workers are running
Verify index builder rebuilds static JSON after new replays

Conclusion: The match pipeline cannot be verified because critical components (matchmaker, workers) are unable to schedule due to cluster CPU exhaustion.

3 KiB Raw Blame History