jedarden d7f5bd7e7f docs(bf-3u9): document matchmaker job creation verification failure

- Cluster capacity insufficient to schedule acb-matchmaker pod
- All ACB pods stuck in Pending state due to insufficient CPU
- No jobs exist because matchmaker has never been able to start
- Verification cannot complete until cluster capacity is restored
- One node NotReady (prod-instance-17825591427380770)
- Total pending CPU requests: ~2250m vs ~4181m available (but fragmentation/blocking)

2026-06-27 14:40:24 -04:00

3.7 KiB

Raw Blame History

Matchmaker Job Creation Verification - bf-3u9

Date: 2026-06-27 Cluster: apexalgo-iad Namespace: ai-code-battle

Critical Finding: Cluster Capacity Blocks Job Creation

The acb-matchmaker logs cannot be checked because the matchmaker pod has never been able to start. All pods in the ai-code-battle namespace are stuck in Pending state due to insufficient cluster CPU capacity.

Current Cluster Status

Nodes (3 total)

prod-instance-17781842321795040: Ready, 32% CPU (1152m/3500m used), 15% memory
prod-instance-17825487911280674: Ready, 47% CPU (1667m/3500m used), 65% memory
prod-instance-17825591427380770: NotReady, 2% CPU (83m), 12% memory

Pod Status

Running: Only acb-schema-init-5b698c549d-wzhnc (1/1)
Pending: All other pods including:
- acb-matchmaker-64f6dc5985-9vh67 (pending for 63+ minutes)
- acb-api-5646489f75-fs7wx
- acb-worker-bf5bfdb98-68k4r
- 8 bot strategy pods (random, rusher, gatherer, guardian, hunter, swarm, farmer)
- acb-evolver, acb-enrichment, acb-index-builder

Job Creation Status

No jobs exist in the ai-code-battle namespace. Job creation cannot occur because:

The matchmaker pod cannot schedule due to insufficient CPU
Even if scheduled, the matchmaker requires PostgreSQL connection (from pending pods)
Workers are also pending, so no jobs could execute even if created

Scheduling Failure Details

All pending pods show this pattern:

0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu

The NotReady node (prod-instance-17825591427380770) appears to be a newly added node (7h8m old) that may still be initializing or has issues.

Resource Analysis

Available CPU (Ready nodes only)

Node 1: ~2348m available (3500m - 1152m used)
Node 2: ~1833m available (3500m - 1667m used)
Total available: ~4181m CPU

Pending pod CPU requests (estimated)

acb-matchmaker: 50m
acb-api (2 pods): 200m
acb-enrichment (2 pods): 400m
acb-evolver (2 pods): 1000m
acb-worker (2 pods): ~200m
8 bot strategy pods: ~400m
acb-index-builder: 50m
Total requests: ~2250m

Theoretically there should be enough CPU (~4181m available vs ~2250m needed), but scheduler reports insufficient CPU. This suggests:

Other workloads on the cluster consuming CPU not shown in kubectl top nodes
Resource fragmentation preventing scheduling of larger pods
The NotReady node blocking some scheduling attempts

Verification Conclusion

Status: ❌ VERIFICATION FAILED - Infrastructure Issue

The matchmaker job creation cannot be verified because:

Cluster capacity insufficient - Matchmaker pod cannot schedule
No jobs in queue - Query returns 0 jobs (expected since matchmaker never ran)
No logs available - Pod never started, so no logs to check

Next Steps Required

Fix cluster capacity - Either:
- Add more nodes to the cluster
- Scale down resource requests for ACB pods
- Move other workloads off apexalgo-iad to free capacity
Fix NotReady node - Investigate why prod-instance-17825591427380770 is NotReady
Re-deploy ACB stack - Once capacity is available, delete and recreate pods
Re-run verification - Check matchmaker logs after pods are running

Acceptance Criteria Status

❌ acb-matchmaker logs show successful job creation - CANNOT VERIFY (pod never started)
❌ Jobs appear in the queue with valid bot pairs - NO JOBS (matchmaker never ran)
❌ No errors in matchmaker scheduling logic - CANNOT VERIFY (no logs)

Recommendation

This verification should be re-attempted after cluster capacity is restored. The current apexalgo-iad cluster appears under-provisioned for the ACB workload.

3.7 KiB Raw Blame History