ai-code-battle/notes/bf-2ws-task-summary.md
jedarden 05512a53fd docs(bf-2ws): add task summary for acb-index-builder OOMKill fix
- Code fixes completed and committed (b35a2aa, 1b399a1, 7e9d1af)
- Pod currently Pending due to cluster capacity (not CrashLoopBackOff)
- Additional fixes in HEAD not yet deployed
- Verification blocked by cluster resource constraints
2026-06-25 07:51:04 -04:00

2.4 KiB

acb-index-builder OOMKill Fix Task Summary

Task

Fix acb-index-builder CrashLoopBackOff (silent crash after web asset copy)

Root Cause Identified

OOMKill caused by N+1 query problems and unbounded database queries:

  1. fetchBots N+1 query loop: 10,000+ separate database calls for bot match stats
  2. fetchSeries N+1 query loop: 1000+ separate queries for series games
  3. fetchChampionshipBracket N+1 query loop: 500+ separate queries for championship games
  4. Unbounded queries: Multiple queries without LIMIT clauses

Fixes Applied (committed to codebase)

Commit b35a2aa (DEPLOYED)

  • Fixed N+1 query loop in fetchBots
  • Single batch query for bot match stats
  • Added LIMIT 20000

Commits 1b399a1, 7e9d1af (code fixed, NOT deployed)

  • Fixed N+1 query loops in fetchSeries and fetchChampionshipBracket
  • Batch queries replacing per-item loops
  • Reduced LIMITs across all queries:
    • fetchRatingHistory: LIMIT 5000
    • fetchSeries: LIMIT 1000
    • fetchSeasons: LIMIT 100
    • fetchPredictions: LIMIT 1000
    • fetchMaps: LIMIT 1000
    • series games batch: LIMIT 10000
    • championship games batch: LIMIT 500
    • pair frequency: LIMIT 1000

main.go panic recovery (lines 165-172)

  • Defer recover() catches panics and logs via slog
  • Prevents silent crashes where stderr is lost

Current Status

Deployment State

  • Deployed image: ronaldraygun/acb-index-builder:b35a2aa
  • Code HEAD: 96d7fb8 (includes ALL fixes)
  • Gap: Additional fixes in HEAD not yet deployed

Cluster Status

  • Pod: acb-index-builder-7fc99df58b-5zjpp
  • Status: Pending (not CrashLoopBackOff)
  • Reason: Cluster overcommitted (94% memory, 98% CPU)
  • Blocker: Cannot free resources or deploy new image with read-only access

Acceptance Criteria Status

Criteria Status
acb-index-builder runs through 2+ build cycles Blocked (cluster capacity)
"Build cycle completed" in logs Blocked (pod Pending)
No CrashLoopBackOff Not applicable (pod Pending)

Conclusion

Code fixes: Complete and committed Deployment: Partial (only first fix deployed) Verification: Blocked (cluster capacity constraints)

The root cause has been identified and fixed in the codebase. Full deployment and verification require:

  1. Building new image with HEAD (96d7fb8)
  2. Freeing cluster resources or scaling cluster
  3. Deploying and monitoring pod for 2+ build cycles