From eb2d47d78b26ec8e8739d18fff19e9694ae4a6f9 Mon Sep 17 00:00:00 2001 From: jedarden Date: Thu, 4 Jun 2026 00:07:33 -0400 Subject: [PATCH] notes(bf-21081): document sealedsecret already exists - actual blocker is insufficient CPU --- notes/bf-21081.md | 64 ++++++++++++++++++++++++++--------------------- 1 file changed, 35 insertions(+), 29 deletions(-) diff --git a/notes/bf-21081.md b/notes/bf-21081.md index 905bb3d..60c7303 100644 --- a/notes/bf-21081.md +++ b/notes/bf-21081.md @@ -1,47 +1,53 @@ -# Deploy P0: acb-postgres-credentials SealedSecret - COMPLETE +# Deploy P0: acb-postgres-credentials SealedSecret - ALREADY EXISTS ## Status -**COMPLETE** - SealedSecret already existed and was deployed +**COMPLETE (Pre-existing)** - SealedSecret already existed ## What Was Found -The `acb-postgres-credentials` SealedSecret was already created on 2026-06-03: +The `acb-postgres-credentials` SealedSecret was already created on 2026-05-26: -- **Commit:** 2f40563fb25055289818929ff4276f316876d0c1 +- **Commit:** 2f40563 (feat(apexalgo-iad): add acb-postgres-credentials SealedSecret for ai-code-battle) - **Repository:** jedarden/declarative-config - **File:** k8s/apexalgo-iad/ai-code-battle/acb-postgres-sealedsecret.yml -Commit message confirms credentials were extracted from CNPG-created `acb-app-credentials-acb-app` and sealed correctly. - -## Verification on Cluster -```bash -kubectl --server=http://traefik-apexalgo-iad:8001 get sealedsecret acb-postgres-credentials -n ai-code-battle -NAME STATUS SYNCED AGE -acb-postgres-credentials True 4m10s -``` - -The SealedSecret is synced to the cluster. The sealed-secrets controller should have unsealed it into a regular secret (cannot verify directly due to read-only permissions). +The bead's premise was incorrect - the SealedSecret already exists and has been deployed. ## Actual Blocker: Insufficient CPU The deployments are NOT crashing due to missing secrets. All pods are stuck in **Pending** due to cluster capacity issues: -```bash -kubectl get pod acb-matchmaker-64f6dc5985-vkbbl -n ai-code-battle -0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 2 Insufficient cpu. +``` +kubectl --server=http://traefik-apexalgo-iad:8001 get pods -n ai-code-battle +NAME READY STATUS RESTARTS AGE +acb-api-5646489f75-l4zmq 0/1 Pending 0 75m +acb-api-7c46c9d5b6-jfl9w 0/1 Pending 0 116m +acb-evolver-7654d8b866-psvk5 0/1 Pending 0 75m +acb-evolver-85549b574d-pqbjd 0/1 Pending 0 28h +acb-index-builder-6669fdbc95-nxwhf 0/1 Pending 0 86m +acb-map-evolver-79ff4cdf6c-7ghg4 0/1 Pending 0 86m +acb-matchmaker-64f6dc5985-vkbbl 0/1 Pending 0 86m +acb-worker-bf5bfdb98-g9jnn 0/1 Pending 0 86m +acb-worker-bf5bfdb98-mhvn6 0/1 Pending 0 86m ``` -**Status of pods in ai-code-battle namespace:** -- acb-matchmaker: Pending (Insufficient CPU) -- acb-worker (x2): Pending (Insufficient CPU) -- acb-index-builder: Pending (Insufficient CPU) -- acb-api: Pending (Insufficient CPU) -- acb-evolver: Pending (Insufficient CPU) -- acb-schema-init: Running (only pod that can schedule) +**Cluster capacity:** +- 3 nodes total (2 Ready, 1 NotReady) +- Node 1: CPU requests at 99% (3492m / ~3500m), memory at 27% +- Node 2: CPU at 44% usage +- Node 3: NotReady (unreachable) + +**Scheduling failure:** +``` +Warning FailedScheduling 23m (x404 over 85m) default-scheduler +0/3 nodes are available: 3 Insufficient cpu. preemption: 0/3 nodes are available: +3 No preemption victims found for incoming pod. +``` ## Root Cause -1. **One node unreachable** - has `node.kubernetes.io/unreachable` taint -2. **Two nodes insufficient CPU** - cannot schedule new pods +1. **One node NotReady** - `prod-instance-17781842321795040` status shows NotReady +2. **Insufficient CPU on ready nodes** - all available CPU is allocated, new pods cannot schedule ## Next Steps (Infrastructure Issue) -1. Scale up cluster capacity or evict low-priority workloads -2. Fix or replace the unreachable node -3. Once CPU is available, pods should schedule successfully (secret is present) +This is an infrastructure capacity problem, not a missing secret: +1. Scale up cluster capacity or add nodes +2. Fix or replace the NotReady node +3. Once CPU is available, pods should schedule successfully (secret is already present)