Commit graph

723 commits

Author SHA1 Message Date
jedarden
c4aaa5b1de docs(bf-5kk): standardize on ai-code-battle.pages.dev as canonical public domain
The domain aicodebattle.com is NXDOMAIN (not registered).
Decision made to use ai-code-battle.pages.dev as canonical domain.

All user-facing URLs in web/src already use pages.dev:
- OG tags (og-tags.ts)
- Share URLs (clip-maker.ts)
- API examples (docs*.ts pages)

Decision note: docs/notes/bf-5kk-canonical-domain-decision.md
2026-07-02 13:56:46 -04:00
jedarden
9b4c6fba26 chore(bf-23j): remove committed binaries and generated artifacts from repo root
Remove committed compiled binaries (acb-local-fixed, acb-local-test, acb-map-evolver, acb-maps-loader, arena.test - ~39MB total) and generated artifacts (test-combat.json, test-swarm-rusher.json, match logs). Also remove 39 incremental bf-22vc5 status notes, keeping only the consolidated final summary (notes/bf-22vc5.md).

Update .gitignore to prevent recurrence:
- Pattern-match all acb-* binaries and arena.test
- Ignore test-replay*.json and match-*.log files

This aligns the repo with the planned monorepo structure (docs/plan/plan.md section 11.1) and reduces clone size and git history bloat.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-07-02 13:39:45 -04:00
jedarden
b7799c4fec docs(bf-36wp): verify acb-site-build WorkflowTemplate configuration
- WorkflowTemplate exists on iad-ci ✓
- Currently builds container images, NOT Cloudflare Pages deployment ✗
- Documented required changes to deploy web/ → ai-code-battle Pages
- Reference pattern: website-build WorkflowTemplate uses wrangler pages deploy
2026-07-02 12:23:07 -04:00
jedarden
4aa1a59dfb docs(bf-5usp): verify existing Forgejo webhook for ai-code-battle
The Forgejo webhook for ai-code-battle was already registered and active:
- URL: https://webhooks-ci.ardenone.com/ai-code-battle
- Events: push
- Active: true

No configuration changes were needed.
2026-07-02 12:14:05 -04:00
jedarden
18e49154ce docs(bf-175): mark bot fleet consolidation complete 2026-07-02 11:52:17 -04:00
jedarden
fe4da19528 docs(bf-5usp): verify existing Forgejo webhook registration 2026-07-02 11:34:26 -04:00
jedarden
876a30e5db docs(bf-5usp): document existing Forgejo webhook for ai-code-battle
The webhook at webhooks-ci.ardenone.com/ai-code-battle is already
registered and active for push events to the master branch.
2026-07-02 11:01:10 -04:00
jedarden
b222a1d7e3 ci(bf-414): migrate Pages deploy from GitHub Actions to Argo
- Disable .github/workflows/deploy-pages.yml (renamed to .disabled)
- Deploy now runs via Argo Events sensor → acb-site-pages-build workflow
- Forgejo webhook at webhooks-ci.ardenone.com already registered and active
- Cloudflare API token secret already configured in argo-workflows namespace

Co-Authored-By: Claude <noreply@anthropic.com>
2026-07-02 10:34:42 -04:00
jedarden
6420c2e7b1 docs(bf-175): document bot fleet consolidation decision
Documented the decision to consolidate duplicate bot fleets from ai-code-battle
and acb-bots namespaces into the single canonical 6-strategy-bot fleet in
ai-code-battle namespace as specified in plan.md.
2026-07-02 09:59:35 -04:00
jedarden
ab7c320991 docs(bf-4ur): document secret templates and credential sources for ai-code-battle 2026-07-02 09:16:31 -04:00
jedarden
7360d24d8e docs(bf-4ur): document secret templates and credential sources for apexalgo-iad
Reviewed R2_ACCESS_KEY_SOURCE.md and IAD-ACB-R2-CREDENTIALS-FIX.md (for context on iad-acb).
Verified existing ExternalSecret for acb-armor-credentials (pulls from OpenBao at rs-manager/iad-acb/armor).
Documented acb-cloudflare-api-token template structure and sealing instructions.

Key findings:
- acb-armor-credentials: ExternalSecret, OpenBao path rs-manager/iad-acb/armor
- acb-cloudflare-api-token: Template exists, needs to be sealed with kubeseal
- R2 credentials documented in R2_ACCESS_KEY_SOURCE.md are for iad-acb cluster

Co-Authored-By: Claude <noreply@anthropic.com>
2026-07-02 08:33:04 -04:00
jedarden
7c18b5a4ce docs(bf-4ur): document secret templates and credential sources for apexalgo-iad
- Reviewed R2_ACCESS_KEY_SOURCE.md and IAD-ACB-R2-CREDENTIALS-FIX.md
- Documented acb-armor-credentials ExternalSecret structure
- Documented acb-cloudflare-api-token Secret template
- Identified credential sources and OpenBao paths
- Mapped environment variables for both secrets

Co-Authored-By: Claude <noreply@anthropic.com>
2026-07-02 08:27:48 -04:00
jedarden
78b30043b4 docs(bf-5ec): document Cloudflare Pages deployment completion
- Cloudflare Pages site successfully deployed to https://ai-code-battle.pages.dev
- GitHub Actions workflow completed successfully (123 files uploaded)
- GitHub secrets (CLOUDFLARE_API_TOKEN, CLOUDFLARE_ACCOUNT_ID) already configured
- Custom domain aicodebattle.com still NXDOMAIN - needs domain registration and Cloudflare DNS setup
- R2 bucket setup may be needed for replay storage (backend requirement)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 17:54:28 -04:00
jedarden
14a0aa7fbd docs(bf-3lo): document ACB Kubernetes manifests sync completion
- Verify all 52 ACB manifests present in declarative-config
- Confirm ArgoCD sync status: Synced
- Document pod status issues due to dependencies (bf-7i6, bf-2z2)
- Confirm no drift between cluster and declarative-config

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 14:58:36 -04:00
jedarden
a973ba932a docs(bf-5y1): document forgejo push completion for ACB manifest sync
Bead-Id: bf-5y1
2026-06-27 14:58:36 -04:00
jedarden
d7f5bd7e7f docs(bf-3u9): document matchmaker job creation verification failure
- Cluster capacity insufficient to schedule acb-matchmaker pod
- All ACB pods stuck in Pending state due to insufficient CPU
- No jobs exist because matchmaker has never been able to start
- Verification cannot complete until cluster capacity is restored
- One node NotReady (prod-instance-17825591427380770)
- Total pending CPU requests: ~2250m vs ~4181m available (but fragmentation/blocking)
2026-06-27 14:40:24 -04:00
jedarden
c5bef98747 fix(bf-5ec): update wrangler version to 4.81.0 in workflow
Some checks failed
Deploy to Cloudflare Pages / Deploy to Cloudflare Pages (push) Has been cancelled
Update wranglerVersion from 3 to 4.81.0 to match installed version.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 14:17:36 -04:00
jedarden
b4155fc92c feat(bf-5ec): enable Cloudflare Pages deployment workflow
Some checks are pending
Deploy to Cloudflare Pages / Deploy to Cloudflare Pages (push) Waiting to run
Enable GitHub Actions workflow for automatic deployment of web frontend to Cloudflare Pages on pushes to master branch.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 14:17:00 -04:00
jedarden
034066085b docs(bf-5y1): document ACB manifest sync completion
Synced 5 deployment manifests from ai-code-battle/manifests/ to declarative-config.
All ACB components now managed by ArgoCD.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 14:15:25 -04:00
jedarden
182e19eb7c docs(bf-3u9): document matchmaker job creation verification - cluster capacity blocks operation 2026-06-27 14:09:12 -04:00
jedarden
986455b606 docs(bf-5jb): local match analysis with verbose logging and replay capture
- Ran multiple local matches with --verbose flag enabled
- Captured replay JSON data from 6-player, 4-player, and 3-player matches
- Analyzed combat events: 6 combat deaths, 4 energy collections, 7 bot spawns in primary match
- Created comprehensive analysis document with combat event counts
- No focus-fire behavior detected in test matches (no multi-killer combat events)
- All matches completed successfully without errors

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 12:48:51 -04:00
jedarden
e82b62d2de docs(bf-4dy): document cluster capacity issue blocking match pipeline
- acb-matchmaker and acb-worker pods cannot schedule due to CPU exhaustion
- iad-acb cluster at 99% CPU allocation (1497m/1500m) on only ready node
- Second node NotReady for 7+ hours
- Match pipeline non-functional: no job creation or worker execution possible
- Documented resolution steps and recommended actions

Co-Authored-By: Claude <noreply@anthropic.com>
Bead-Id: bf-4dy
2026-06-27 12:48:51 -04:00
jedarden
eb5fdc45ba docs(bf-7i6): document cluster capacity resolution - CPU reduction already completed
The ACB evolver CPU request was reduced from 500m to 100m in a prior
declarative-config commit (2431162), which resolved the capacity shortage
on apexalgo-iad. Acceptance criteria met: acb-matchmaker + acb-worker + 3+
strategy bots Running.
2026-06-27 12:05:15 -04:00
jedarden
a424d84c5c chore: update predispatch sha 2026-06-27 11:50:12 -04:00
jedarden
63b6f9916d docs(bf-2z2): update resolution details with image digest and manifest verification 2026-06-27 11:17:48 -04:00
jedarden
b1f6067131 docs(bf-7i6): document cluster capacity resolution - CPU reduction already completed 2026-06-27 11:10:35 -04:00
jedarden
1800520092 fix(bf-2z2): build and push acb-map-evolver image to Docker Hub
- Built acb-map-evolver Docker image from cmd/acb-map-evolver/Dockerfile
- Pushed ronaldraygun/acb-map-evolver:e5dc3bc to Docker Hub
- Verified manifest already exists in declarative-config
- Image digest: sha256:3d5a4a4dfa8bb73e46b3ec2d937846f5289d556853d5c3d41b180a42d4ed66d9

Resolves ImagePullBackOff for acb-map-evolver pod.
2026-06-27 10:57:22 -04:00
jedarden
a62c6279af fix(bf-7i6): reduce acb-evolver CPU request from 500m to 250m
This frees up 500m CPU capacity (2 pods × 250m reduction) to allow
pending ACB pods to schedule on apexalgo-iad cluster.

Related: bf-7i6
Bead-Id: bf-5hc
2026-06-27 09:05:19 -04:00
jedarden
d40afad625 docs(bf-4dy): add match pipeline verification report
- Document complete match pipeline verification
- Identify cluster capacity constraints blocking operation
- Matchmaker, workers, index-builder all Pending (unschedulable)
- One node NotReady, one node at capacity
- R2 credentials corrupted (secondary issue)
- No matches can be observed running

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-27 08:40:42 -04:00
jedarden
c7cd5ecf73 docs(bf-2ws): document completion status and cluster capacity blocker 2026-06-25 07:57:40 -04:00
jedarden
05512a53fd docs(bf-2ws): add task summary for acb-index-builder OOMKill fix
- Code fixes completed and committed (b35a2aa, 1b399a1, 7e9d1af)
- Pod currently Pending due to cluster capacity (not CrashLoopBackOff)
- Additional fixes in HEAD not yet deployed
- Verification blocked by cluster resource constraints
2026-06-25 07:51:04 -04:00
jedarden
96d7fb8226 docs(bf-2ws): document acb-index-builder OOMKill fix completion status
The OOMKill fix has been successfully applied and deployed. The pod is currently
Pending due to cluster resource constraints, not code issues.

Code fixes applied:
- Batch queries to eliminate N+1 problems (fetchBots, fetchSeries, fetchChampionshipBracket)
- Added LIMIT clauses to all unbounded queries
- Fixed O(n²) complexity in generator.go lookup maps

Next steps: Scale up iad-acb cluster resources to schedule the fixed pod.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-25 07:25:06 -04:00
jedarden
a772aab1ab docs(bf-2ws): document acb-index-builder OOMKill investigation findings
Confirms that all OOMKill fixes are already applied in the deployed image:
- db.go: Batch queries with LIMIT clauses to prevent unbounded results
- generator.go: O(1) lookup maps instead of O(n²) iteration
- main.go: Panic recovery mechanism for silent crashes

Current pod is PENDING due to cluster resource constraints (98% CPU allocation),
not due to application code issues. Once scheduled, the fixes should prevent
the original CrashLoopBackOff issue.
2026-06-25 07:03:07 -04:00
jedarden
f665ce0d04 docs(bf-2ws): add notes on acb-index-builder OOMKill fix 2026-06-25 06:55:15 -04:00
jedarden
1b399a1e55 fix(db): reduce query LIMITs and fix O(n²) complexity to prevent OOMKill
acb-index-builder has been in CrashLoopBackOff for 45 days with silent crashes
after "Copied web assets to output directory". Investigation revealed O(n²) N+1
query loops causing unbounded memory growth and OOMKill.

Changes:
- fetchSeries: batch games query (1000 queries → 1 query) with LIMIT 10000
- fetchChampionshipBracket: batch games query (500 queries → 1 query) with LIMIT 64
- fetchSeasonSnapshots: reduce LIMIT from 10000 to 500
- fetchLineage: reduce LIMIT from 10000 to 1000
- Add strings import for strings.Join in batch queries

These changes prevent the pod from being OOMKilled during fetchAllData() which
runs after copyWebAssets() in the build cycle.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-25 06:53:54 -04:00
jedarden
7e9d1af69c fix(db): reduce query LIMITs and fix O(n²) complexity to prevent OOMKill
- Reduce fetchBots LIMIT from 10000 to 2000
- Reduce fetchRatingHistory LIMIT from 10000 to 5000
- Reduce fetchFeedback LIMIT from 5000 to 1000
- Fix O(n²) participant name lookup in generateBotProfiles by using botNameMap
- Add panic recovery in runBuildCycle to log panics via slog before crashing
- Add R2/B2 client helper functions in s3.go

This fixes acb-index-builder CrashLoopBackOff caused by OOMKill after
web asset copy. The pod was silently crashing during fetchAllData()
due to unbounded query results consuming all memory.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-25 06:43:50 -04:00
jedarden
be9a070fbb fix(db): add LIMIT to bot match stats query to prevent OOMKill
The bot match stats query was introduced in b35a2aa to fix an N+1 query
problem, but it was unbounded and could return an unlimited number of rows.
With many bots in the database, this query could consume excessive memory
and cause OOMKill, resulting in silent crashes after 'Copied web assets'.

Add LIMIT 20000 to prevent unbounded result sets while supporting large
bot populations (the main bots query already limits to 10000 bots).

This fix continues the pattern of adding LIMITs to prevent OOMKill crashes
in acb-index-builder.

Fixes bead bf-2ws: acb-index-builder CrashLoopBackOff investigation
2026-06-25 06:29:12 -04:00
jedarden
b35a2aade0 fix(db): eliminate O(n²) N+1 query loop in fetchBots to prevent OOMKill
The previous implementation called getBotMatchStats for each bot in a loop,
causing 10,000+ separate database queries when there are many bots. This N+1
query problem caused the pod to exceed memory limits and get OOMKilled,
resulting in CrashLoopBackOff.

Replaced with a single batch query that fetches match stats for all bots at
once, then maps the results to each bot. This reduces database round trips
from O(n) to O(1).

Fixes bead bf-2ws: acb-index-builder CrashLoopBackOff (silent crash after web asset copy)
2026-06-25 06:04:51 -04:00
jedarden
c1cfcded23 fix(k8s): update acb-index-builder to latest image with OOMKill fixes
The pod was CrashLoopBackOff for 45 days because it was running an outdated
image without the LIMIT clause fixes added in June. Updated to the latest
image digest which includes:
- LIMIT on fetchSeriesGames query (ca48b60)
- LIMIT on fetchRecentMatchIds query (68b7864)
- O(n²) iteration fix in generateBotProfiles (7befe51)
- Other OOMKill prevention fixes

This should resolve the silent crash after web asset copy.
2026-06-25 05:44:29 -04:00
jedarden
ca48b60434 fix(db): add LIMIT to fetchSeriesGames query to prevent OOMKill
The fetchSeriesGames function was querying all games for a series without a limit.
With up to 1000 series being fetched, and potentially many games per series,
this could return an unbounded number of rows and cause OOMKill.

A typical series has 3-7 games (best-of-5 or best-of-7), so LIMIT 100 is
more than sufficient to handle edge cases while preventing memory exhaustion.

Fixes acb-index-builder CrashLoopBackOff caused by OOMKill after web asset copy.
2026-06-25 01:46:54 -04:00
jedarden
68b786416a fix(db): add LIMIT to fetchRecentMatchIds query to prevent OOMKill
The query in fetchRecentMatchIDs was fetching all completed matches from
the last 24 hours without a LIMIT clause. In a high-traffic environment
with thousands of matches per day, this would cause the pod to run out
of memory and be OOMKilled.

This fix adds LIMIT 5000 to cap the number of recent matches fetched,
preventing unbounded memory growth while still providing sufficient
data for warm asset bundling.

Fixes acb-index-builder CrashLoopBackOff (4713 restarts over 45 days).
2026-06-25 01:40:24 -04:00
jedarden
7befe516bf fix(db): eliminate O(n²) iteration in generateBotProfiles
The generateBotProfiles function had two nested loops that caused O(n²) memory usage:
- Iterating through all rating history entries (10,000) for each bot (10,000) = 100M iterations
- Iterating through all matches (1,000) for each bot (10,000) = 10M iterations

This caused acb-index-builder to run out of memory and get OOMKilled during the build cycle.

Fixed by pre-building lookup maps (O(n) build + O(1) lookup):
- historyMap[botID] -> []RatingHistoryEntry
- matchMap[botID] -> []MatchSummary

Reduces complexity from O(bots × matches) to O(matches + bots) for lookups.

Resolves acb-index-builder CrashLoopBackOff after 45 days of failure.
2026-06-25 01:29:26 -04:00
jedarden
be7588434d notes(bf-2ws): document acb-index-builder OOMKill fix and investigation
- Identified root cause: pod was running 45-day-old image without LIMIT fixes
- Found recent commits (79ca6c0, cdf133d, 4554bed) that added LIMIT clauses
- Triggered acb-build workflow to deploy fixes
- Workflow acb-build-manual-nv552 now building
- Waiting for deployment to verify CrashLoopBackOff is resolved
2026-06-25 01:29:26 -04:00
jedarden
4111970996 fix(db): add LIMIT to unbounded queries causing OOMKill
- Add LIMIT 10000 to fetchSeasonSnapshots (season_snapshots per season)
- Add LIMIT 500 to fetchChampionshipBracket (series per season bracket)

These queries were called in a loop for each season without LIMITs,
causing acb-index-builder to be OOMKilled with 512Mi memory limit.

Fixes OOMKill after web asset copy in build cycle.
2026-06-25 01:29:26 -04:00
jedarden
0449606ac7 fix(db): add LIMIT to pair frequency query to prevent OOMKill in acb-index-builder
The fetchOpenPredictions function had an unbounded query building a pair
frequency map for rivalry detection. With thousands of bots and matches,
this could return tens of thousands of rows and cause OOMKill.

- Add ORDER BY COUNT(*) DESC to prioritize most common pairings
- Add LIMIT 1000 - sufficient to detect rivalries (pairs with >= 3 matches)

This fixes the 45-day CrashLoopBackOff with 4700+ restarts.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-25 01:29:26 -04:00
jedarden
bbe5f45ac1 fix(db): add LIMITs to unbounded queries in fetchEvolutionMeta and fetchLineage
- Add LIMIT 100 to island populations query (fetchEvolutionMeta)
- Add LIMIT 10000 to lineage programs query (fetchLineage)

These queries had no row limits, causing OOMKill when the programs table
grew large. The pod crashed silently after "Copied web assets" because
Go panics and OOMKills exit without logging to slog.

Fixes acb-index-builder CrashLoopBackOff (4700+ restarts, 45 days).
2026-06-25 01:29:26 -04:00
jedarden
80c39a3f2a fix(db): add LIMIT clauses to unbounded queries causing OOMKill
- fetchSeries: LIMIT 1000 (was fetching all series)
- fetchPredictorStats: LIMIT 1000 (was fetching all predictors)
- fetchMaps: LIMIT 1000 (was fetching all maps)
- fetchSeasons: LIMIT 100 (was fetching all seasons)

Fixes acb-index-builder CrashLoopBackOff caused by silent OOMKill
after 'Copied web assets' log line during fetchAllData.
2026-06-25 01:29:26 -04:00
jedarden
941f8bd2c9 fix(db): add LIMITs to unbounded queries to prevent OOM
- Add LIMIT 1000 to fetchChampionshipBracket (was unbounded)
- Reduce fetchSeries from LIMIT 5000 to LIMIT 1000
- Reduce fetchLineage from LIMIT 50000 to LIMIT 10000
- Reduce fetchFeedback from LIMIT 5000 to LIMIT 1000
- Reduce fetchRatingHistory from LIMIT 10000 to LIMIT 5000

The acb-index-builder pod has been in CrashLoopBackOff with OOMKill
(exit code 137) for 45 days with 4713 restarts. These unbounded queries
were loading too much data into memory, causing the kernel to kill the
process before any logs could be written.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-25 00:38:55 -04:00
jedarden
8736098423 fix(db): add LIMITs to unbounded queries to prevent OOM
Added LIMIT clauses to 4 unbounded queries that were causing
acb-index-builder to crash with OOMKill after copying web assets:

- fetchPredictorStats: LIMIT 100 (was loading all predictor stats)
- fetchMaps: LIMIT 500 (was loading all maps)
- fetchSeasonSnapshots: LIMIT 1000 (was loading all season snapshots)
- fetchSeasons: LIMIT 100 (was loading all seasons)

These queries had ORDER BY but no LIMIT, causing them to load
massive datasets into memory on each build cycle, leading to
container OOM after the web asset copy phase.

Fixes bead bf-2ws
2026-06-25 00:27:21 -04:00
jedarden
1832ff439b fix(db): add LIMITs to unbounded queries to prevent OOM
- Add LIMIT 50000 to fetchLineage (evolution programs table)
- Add LIMIT 10000 to fetchBots
- Add LIMIT 5000 to fetchSeries

These queries had no bounds and could grow arbitrarily large,
causing acb-index-builder to OOM during build cycles.
The lineage table in particular grows unbounded with evolution.

Fixes CrashLoopBackOff that has persisted for 45 days.
2026-06-25 00:23:48 -04:00