From decae849c7c2adb096f6e4ea8a6c3a80cec1bcbb Mon Sep 17 00:00:00 2001 From: jedarden Date: Mon, 23 Mar 2026 21:46:23 -0400 Subject: [PATCH] Add research on LLM-driven bot evolution systems Covers FunSearch, AlphaEvolve (+ open-source clones OpenEvolve, ShinkaEvolve), ELM/OpenELM, AlphaCode, Voyager, LLM-PSRO, CATArena, and sandboxing options. Recommends FunSearch island model + LLM-PSRO Nash equilibrium selection for the evolution pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/research/llm-bot-evolution.md | 296 +++++++++++++++++++++++++++++ 1 file changed, 296 insertions(+) create mode 100644 docs/research/llm-bot-evolution.md diff --git a/docs/research/llm-bot-evolution.md b/docs/research/llm-bot-evolution.md new file mode 100644 index 0000000..ab1d658 --- /dev/null +++ b/docs/research/llm-bot-evolution.md @@ -0,0 +1,296 @@ +# LLM-Driven Bot Evolution — Research + +## Overview + +Survey of real systems where LLMs iteratively generate, evaluate, and evolve +code. Focus on approaches applicable to evolving game bot strategies. + +--- + +## 1. FunSearch (DeepMind, 2023) + +**Paper:** [Nature](https://www.nature.com/articles/s41586-023-06924-6) +**Code:** [github.com/google-deepmind/funsearch](https://github.com/google-deepmind/funsearch) + +The cleanest template for "evolve code via LLM." + +**Evolutionary loop:** + +1. User provides an `evaluate` function (fitness scorer) and a trivial seed + implementation of the function to evolve. +2. **Programs Database** stores all scored programs using an **island model** — + multiple independent populations evolving in parallel to maintain diversity. + Within each island, programs are clustered by score signature. Sampling + favors high-scoring clusters; within a cluster, favors shorter programs. +3. **Samplers** pull 2–3 high-scoring programs from the database, combine into + a prompt, and query a pre-trained LLM to generate a new candidate function. +4. **Evaluators** execute the candidate against test inputs and score it. If + correct, it enters the database. +5. Repeat. System runs 15 samplers and 150 CPU evaluators in parallel. + +**Results:** Discovered new constructions for the cap set problem (open math) +and beat human heuristics for online bin-packing. + +**Relevance:** The evaluator maps directly to "run the bot in the game engine, +score the match." Island model prevents convergence to one strategy. + +--- + +## 2. AlphaEvolve (DeepMind, 2025) — FunSearch's Successor + +**Paper:** [arxiv.org/abs/2506.13131](https://arxiv.org/abs/2506.13131) + +Evolves entire codebases (not just single functions). Uses an **ensemble** of +Gemini Flash (exploration/breadth) and Gemini Pro (exploitation/depth). + +**Results:** Recovered 0.7% of Google's worldwide compute by optimizing data +center scheduling. Found new matrix multiplication algorithms beating 1969 +Strassen result. + +**Open-source implementations:** + +| Project | URL | Notes | +|---------|-----|-------| +| **OpenEvolve** | [github.com/algorithmicsuperintelligence/openevolve](https://github.com/algorithmicsuperintelligence/openevolve) | `pip install openevolve`. Full pipeline: prompt sampler, LLM ensemble, evaluator pool, program database with MAP-Elites + island model | +| **ShinkaEvolve** (Sakana AI) | [github.com/SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) | Apache 2.0. Adds novelty rejection-sampling, bandit-based LLM selection. Won ICFP 2025 Programming Contest. Found SOTA for circle packing in ~150 evaluations | +| **OpenAlpha_Evolve** | [github.com/shyamsaktawat/OpenAlpha_Evolve](https://github.com/shyamsaktawat/OpenAlpha_Evolve) | Community implementation | + +--- + +## 3. ELM / OpenELM (Lehman et al., OpenAI, 2022) + +**Paper:** [arxiv.org/pdf/2206.08896](https://arxiv.org/pdf/2206.08896) +**Code:** [github.com/CarperAI/OpenELM](https://github.com/CarperAI/OpenELM) + +**Key insight:** Treats the LLM as a **diff/mutation operator**, not a +from-scratch generator. Uses commit-message-style prompts so the model +understands what kind of change is being requested. + +**Architecture:** + +1. **LLM mutation operator** — generates code diffs, not complete programs +2. **MAP-Elites outer loop** — maintains a grid of niches spanning user-defined + behavior dimensions. Each niche holds the best-performing individual. New + candidates replace niche inhabitants only if they score higher. +3. **LLM fine-tuning on successful mutations** — model updated based on which + mutations worked, closing the loop. + +**Results:** Generated hundreds of thousands of functional Python programs +producing working robots in the Sodarace domain — a domain the LLM had never +seen in training. + +**Relevance:** MAP-Elites ensures diversity of strategies (not just one dominant +approach). The diff-based mutation is practical for evolving bot code — smaller +changes, faster iteration. + +--- + +## 4. AlphaCode / AlphaCode 2 (DeepMind) + +**AlphaCode:** [science.org/doi/10.1126/science.abq1158](https://www.science.org/doi/10.1126/science.abq1158) +**AlphaCode 2:** [Technical Report](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf) + +Not evolutionary — purely generative with massive oversampling and filtering: + +1. Generate ~1 million candidate solutions per problem +2. Filter by executing against example test cases (eliminates ~99%) +3. Cluster remaining by behavioral similarity (run on synthetic inputs, group + programs producing identical outputs) +4. Select one representative per cluster, rank by scoring model, submit top 10 + +**AlphaCode 2:** 85th percentile on Codeforces (vs ~50th for v1). Uses Gemini +Pro with dedicated scoring/reranking model. + +**Relevance:** The "generate many, filter by execution, cluster by behavior" +pattern is useful for the initial seeding phase — generate many candidate bots, +test them all, keep the diverse winners. + +--- + +## 5. Voyager (NVIDIA / MineDojo, 2023) + +**Paper:** [arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291) +**Code:** [github.com/MineDojo/Voyager](https://github.com/MineDojo/Voyager) + +LLM agent that writes its own code in Minecraft. + +**Three components:** + +1. **Automatic Curriculum** — generates increasingly difficult objectives +2. **Skill Library** — persistent, growing collection of verified code snippets. + New tasks retrieve relevant skills by embedding similarity. Successful new + skills get added. +3. **Iterative Prompting with Self-Verification:** + - GPT-4 generates code for a task + - Code executes in environment + - Environment feedback + errors collected + - Self-verification module (also GPT-4) checks completion + - If not complete, feedback fed back and code refined + - Only verified-successful skills enter the library + +**Results:** 3.3x more unique items, 15.3x faster tech-tree progression vs +prior SOTA. Skills transfer to new worlds. + +**Relevance:** The skill library pattern — decomposing strategy into reusable +verified components — could apply to bot strategies (e.g., verified pathfinding, +verified formation combat, verified scouting behaviors composed into full bots). + +--- + +## 6. Game-Bot-Specific Systems + +### LLM-PSRO (IJCAI 2025) — Most Directly Relevant + +**Paper:** [ijcai.org/proceedings/2025/1249](https://www.ijcai.org/proceedings/2025/1249) + +The most directly applicable system. Uses Policy Space Response Oracle: + +1. Start with a population of bots (hand-written or LLM-generated) +2. Run round-robin tournaments → build payoff matrix +3. Compute **Nash equilibrium** mixture over the current population +4. Prompt the LLM to generate a new bot that **beats the Nash mixture**, + providing the losing bot's code and match results as context +5. Add new bot to population +6. Repeat + +**Why Nash matters:** You can only add bots that improve the population's +game-theoretic profile. This is mathematically principled regression prevention +— the new bot must beat the optimal mixed strategy, not just one opponent. + +### CATArena (2025) + +**Paper:** [arxiv.org/abs/2510.26852](https://arxiv.org/abs/2510.26852) +**Code:** [github.com/AGI-Eval-Official/CATArena](https://github.com/AGI-Eval-Official/CATArena) + +LLM code agents play Gomoku, Texas Hold'em, Chess, Bridge. Agents refine +through **self-reflection** (analyzing own losses) and **peer-learning** +(reading opponent code). + +**Key finding:** Evolutionary potential doesn't correlate with initial +proficiency — some weaker initial agents evolve faster. + +### AlphaCodium (CodiumAI / Qodo) + +**Paper:** [arxiv.org/abs/2401.08500](https://arxiv.org/abs/2401.08500) +**Code:** [github.com/Codium-ai/AlphaCodium](https://github.com/Codium-ai/AlphaCodium) + +Two-phase iterative flow: pre-processing (self-reflection, test reasoning) +then generate-execute-refine against tests. Boosted GPT-4 from 19% to 44% +on CodeContests with only 15–20 LLM calls per solution. + +### STOP — Self-Taught Optimizer (Microsoft Research, COLM 2024) + +**Paper:** [arxiv.org/abs/2310.02304](https://arxiv.org/abs/2310.02304) +**Code:** [github.com/microsoft/stop](https://github.com/microsoft/stop) + +A seed "improver" program that calls GPT-4 to improve code, then is run on +itself to improve the improver. The self-improved improver discovers strategies +like beam search, genetic algorithms, and simulated annealing — on its own. + +--- + +## 7. Evaluation / Selection Mechanisms + +| System | Selection Mechanism | +|--------|---------------------| +| **FunSearch / AlphaEvolve** | Island model with score-based cluster sampling. Higher scores sampled more. Islands prevent premature convergence. | +| **OpenELM** | MAP-Elites quality-diversity grid. New candidate replaces niche inhabitant only if it scores higher. Maintains diversity across behavior dimensions. | +| **AlphaCode** | Generate millions, filter by execution, cluster by behavior, rank by scoring model. Pure oversampling, no evolution. | +| **LLM-PSRO** | Nash equilibrium over population. New bots must beat the Nash mixture. Theoretically grounded regression prevention. | +| **CATArena** | Dual-metric: static proficiency vs evolutionary potential. Global win rate preferred over Elo for stability. | +| **ShinkaEvolve** | Parent sampling balancing exploration/exploitation + novelty rejection-sampling (rejects candidates too similar to existing population). | + +--- + +## 8. Code Sandboxing for LLM-Generated Code + +| Solution | Isolation Level | Overhead | Used By | +|----------|----------------|----------|---------| +| **Firecracker MicroVMs** | Strongest (own kernel) | <200ms boot, <5 MiB/VM | AWS Lambda, E2B ([e2b.dev](https://e2b.dev/)), ~50% Fortune 500 | +| **gVisor** | Strong (userspace kernel) | Low | GKE Sandbox, [k8s agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) | +| **nsjail** | Moderate (namespaces + seccomp) | Minimal | FunSearch evaluators (150 nodes) | +| **WASM** | Moderate (no fs/network) | Near-native | Constrained execution environments | + +**Recommendation for bot evolution:** nsjail for high-throughput evaluation +(you control the game engine); Firecracker/E2B if executing fully arbitrary +LLM code with network/filesystem access. + +**Reference:** [github.com/restyler/awesome-sandbox](https://github.com/restyler/awesome-sandbox) + +--- + +## 9. Recommended Architecture for AI Code Battle + +Based on the systems above, the **FunSearch/AlphaEvolve island model + +LLM-PSRO game-theoretic selection** combination is the best fit: + +``` +┌─────────────────────────────────────────────────────┐ +│ Programs Database │ +│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ +│ │ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island 4 │ │ +│ │ (Python) │ │ (Go) │ │ (Rust) │ │ (mixed) │ │ +│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ +└───────────────────────┬─────────────────────────────┘ + │ + sample 2-3 parents + match replays + │ + ┌─────────▼──────────┐ + │ Prompt Builder │ + │ • Parent code │ + │ • Recent losses │ + │ • Replay analysis │ + │ • "Beat this meta" │ + └─────────┬──────────┘ + │ + ┌─────────▼──────────┐ + │ LLM Ensemble │ + │ • Fast model │ + │ (exploration) │ + │ • Strong model │ + │ (exploitation) │ + └─────────┬──────────┘ + │ + ┌─────────▼──────────┐ + │ Build & Validate │ + │ • Compile/lint │ + │ • Schema test │ + │ • Sandbox execute │ + └─────────┬──────────┘ + │ + ┌─────────▼──────────┐ + │ Tournament Gate │ + │ • Play vs current │ + │ population │ + │ • Must beat Nash │ + │ mixture (PSRO) │ + └─────────┬──────────┘ + │ + promote if better + │ + ┌─────────▼──────────┐ + │ Deploy as │ + │ Container │ + │ • Build image │ + │ • Register bot │ + │ • Enter ladder │ + └────────────────────┘ +``` + +### References + +- [FunSearch — GitHub](https://github.com/google-deepmind/funsearch) +- [OpenEvolve — GitHub](https://github.com/algorithmicsuperintelligence/openevolve) +- [ShinkaEvolve — GitHub](https://github.com/SakanaAI/ShinkaEvolve) +- [OpenELM — GitHub](https://github.com/CarperAI/OpenELM) +- [AlphaCode Dataset — GitHub](https://github.com/google-deepmind/code_contests) +- [Voyager — GitHub](https://github.com/MineDojo/Voyager) +- [LLM-PSRO — IJCAI 2025](https://www.ijcai.org/proceedings/2025/1249) +- [CATArena — GitHub](https://github.com/AGI-Eval-Official/CATArena) +- [AlphaCodium — GitHub](https://github.com/Codium-ai/AlphaCodium) +- [STOP — GitHub](https://github.com/microsoft/stop) +- [Awesome LLM Game Agent Papers](https://github.com/git-disl/awesome-LLM-game-agent-papers) +- [Awesome Self-Evolving Agents](https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents) +- [nsjail — GitHub](https://github.com/google/nsjail) +- [E2B Sandbox](https://e2b.dev/) +- [Awesome Sandbox](https://github.com/restyler/awesome-sandbox)