# LLM-Driven Bot Evolution — Research ## Overview Survey of real systems where LLMs iteratively generate, evaluate, and evolve code. Focus on approaches applicable to evolving game bot strategies. --- ## 1. FunSearch (DeepMind, 2023) **Paper:** [Nature](https://www.nature.com/articles/s41586-023-06924-6) **Code:** [github.com/google-deepmind/funsearch](https://github.com/google-deepmind/funsearch) The cleanest template for "evolve code via LLM." **Evolutionary loop:** 1. User provides an `evaluate` function (fitness scorer) and a trivial seed implementation of the function to evolve. 2. **Programs Database** stores all scored programs using an **island model** — multiple independent populations evolving in parallel to maintain diversity. Within each island, programs are clustered by score signature. Sampling favors high-scoring clusters; within a cluster, favors shorter programs. 3. **Samplers** pull 2–3 high-scoring programs from the database, combine into a prompt, and query a pre-trained LLM to generate a new candidate function. 4. **Evaluators** execute the candidate against test inputs and score it. If correct, it enters the database. 5. Repeat. System runs 15 samplers and 150 CPU evaluators in parallel. **Results:** Discovered new constructions for the cap set problem (open math) and beat human heuristics for online bin-packing. **Relevance:** The evaluator maps directly to "run the bot in the game engine, score the match." Island model prevents convergence to one strategy. --- ## 2. AlphaEvolve (DeepMind, 2025) — FunSearch's Successor **Paper:** [arxiv.org/abs/2506.13131](https://arxiv.org/abs/2506.13131) Evolves entire codebases (not just single functions). Uses an **ensemble** of Gemini Flash (exploration/breadth) and Gemini Pro (exploitation/depth). **Results:** Recovered 0.7% of Google's worldwide compute by optimizing data center scheduling. Found new matrix multiplication algorithms beating 1969 Strassen result. **Open-source implementations:** | Project | URL | Notes | |---------|-----|-------| | **OpenEvolve** | [github.com/algorithmicsuperintelligence/openevolve](https://github.com/algorithmicsuperintelligence/openevolve) | `pip install openevolve`. Full pipeline: prompt sampler, LLM ensemble, evaluator pool, program database with MAP-Elites + island model | | **ShinkaEvolve** (Sakana AI) | [github.com/SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) | Apache 2.0. Adds novelty rejection-sampling, bandit-based LLM selection. Won ICFP 2025 Programming Contest. Found SOTA for circle packing in ~150 evaluations | | **OpenAlpha_Evolve** | [github.com/shyamsaktawat/OpenAlpha_Evolve](https://github.com/shyamsaktawat/OpenAlpha_Evolve) | Community implementation | --- ## 3. ELM / OpenELM (Lehman et al., OpenAI, 2022) **Paper:** [arxiv.org/pdf/2206.08896](https://arxiv.org/pdf/2206.08896) **Code:** [github.com/CarperAI/OpenELM](https://github.com/CarperAI/OpenELM) **Key insight:** Treats the LLM as a **diff/mutation operator**, not a from-scratch generator. Uses commit-message-style prompts so the model understands what kind of change is being requested. **Architecture:** 1. **LLM mutation operator** — generates code diffs, not complete programs 2. **MAP-Elites outer loop** — maintains a grid of niches spanning user-defined behavior dimensions. Each niche holds the best-performing individual. New candidates replace niche inhabitants only if they score higher. 3. **LLM fine-tuning on successful mutations** — model updated based on which mutations worked, closing the loop. **Results:** Generated hundreds of thousands of functional Python programs producing working robots in the Sodarace domain — a domain the LLM had never seen in training. **Relevance:** MAP-Elites ensures diversity of strategies (not just one dominant approach). The diff-based mutation is practical for evolving bot code — smaller changes, faster iteration. --- ## 4. AlphaCode / AlphaCode 2 (DeepMind) **AlphaCode:** [science.org/doi/10.1126/science.abq1158](https://www.science.org/doi/10.1126/science.abq1158) **AlphaCode 2:** [Technical Report](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf) Not evolutionary — purely generative with massive oversampling and filtering: 1. Generate ~1 million candidate solutions per problem 2. Filter by executing against example test cases (eliminates ~99%) 3. Cluster remaining by behavioral similarity (run on synthetic inputs, group programs producing identical outputs) 4. Select one representative per cluster, rank by scoring model, submit top 10 **AlphaCode 2:** 85th percentile on Codeforces (vs ~50th for v1). Uses Gemini Pro with dedicated scoring/reranking model. **Relevance:** The "generate many, filter by execution, cluster by behavior" pattern is useful for the initial seeding phase — generate many candidate bots, test them all, keep the diverse winners. --- ## 5. Voyager (NVIDIA / MineDojo, 2023) **Paper:** [arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291) **Code:** [github.com/MineDojo/Voyager](https://github.com/MineDojo/Voyager) LLM agent that writes its own code in Minecraft. **Three components:** 1. **Automatic Curriculum** — generates increasingly difficult objectives 2. **Skill Library** — persistent, growing collection of verified code snippets. New tasks retrieve relevant skills by embedding similarity. Successful new skills get added. 3. **Iterative Prompting with Self-Verification:** - GPT-4 generates code for a task - Code executes in environment - Environment feedback + errors collected - Self-verification module (also GPT-4) checks completion - If not complete, feedback fed back and code refined - Only verified-successful skills enter the library **Results:** 3.3x more unique items, 15.3x faster tech-tree progression vs prior SOTA. Skills transfer to new worlds. **Relevance:** The skill library pattern — decomposing strategy into reusable verified components — could apply to bot strategies (e.g., verified pathfinding, verified formation combat, verified scouting behaviors composed into full bots). --- ## 6. Game-Bot-Specific Systems ### LLM-PSRO (IJCAI 2025) — Most Directly Relevant **Paper:** [ijcai.org/proceedings/2025/1249](https://www.ijcai.org/proceedings/2025/1249) The most directly applicable system. Uses Policy Space Response Oracle: 1. Start with a population of bots (hand-written or LLM-generated) 2. Run round-robin tournaments → build payoff matrix 3. Compute **Nash equilibrium** mixture over the current population 4. Prompt the LLM to generate a new bot that **beats the Nash mixture**, providing the losing bot's code and match results as context 5. Add new bot to population 6. Repeat **Why Nash matters:** You can only add bots that improve the population's game-theoretic profile. This is mathematically principled regression prevention — the new bot must beat the optimal mixed strategy, not just one opponent. ### CATArena (2025) **Paper:** [arxiv.org/abs/2510.26852](https://arxiv.org/abs/2510.26852) **Code:** [github.com/AGI-Eval-Official/CATArena](https://github.com/AGI-Eval-Official/CATArena) LLM code agents play Gomoku, Texas Hold'em, Chess, Bridge. Agents refine through **self-reflection** (analyzing own losses) and **peer-learning** (reading opponent code). **Key finding:** Evolutionary potential doesn't correlate with initial proficiency — some weaker initial agents evolve faster. ### AlphaCodium (CodiumAI / Qodo) **Paper:** [arxiv.org/abs/2401.08500](https://arxiv.org/abs/2401.08500) **Code:** [github.com/Codium-ai/AlphaCodium](https://github.com/Codium-ai/AlphaCodium) Two-phase iterative flow: pre-processing (self-reflection, test reasoning) then generate-execute-refine against tests. Boosted GPT-4 from 19% to 44% on CodeContests with only 15–20 LLM calls per solution. ### STOP — Self-Taught Optimizer (Microsoft Research, COLM 2024) **Paper:** [arxiv.org/abs/2310.02304](https://arxiv.org/abs/2310.02304) **Code:** [github.com/microsoft/stop](https://github.com/microsoft/stop) A seed "improver" program that calls GPT-4 to improve code, then is run on itself to improve the improver. The self-improved improver discovers strategies like beam search, genetic algorithms, and simulated annealing — on its own. --- ## 7. Evaluation / Selection Mechanisms | System | Selection Mechanism | |--------|---------------------| | **FunSearch / AlphaEvolve** | Island model with score-based cluster sampling. Higher scores sampled more. Islands prevent premature convergence. | | **OpenELM** | MAP-Elites quality-diversity grid. New candidate replaces niche inhabitant only if it scores higher. Maintains diversity across behavior dimensions. | | **AlphaCode** | Generate millions, filter by execution, cluster by behavior, rank by scoring model. Pure oversampling, no evolution. | | **LLM-PSRO** | Nash equilibrium over population. New bots must beat the Nash mixture. Theoretically grounded regression prevention. | | **CATArena** | Dual-metric: static proficiency vs evolutionary potential. Global win rate preferred over Elo for stability. | | **ShinkaEvolve** | Parent sampling balancing exploration/exploitation + novelty rejection-sampling (rejects candidates too similar to existing population). | --- ## 8. Code Sandboxing for LLM-Generated Code | Solution | Isolation Level | Overhead | Used By | |----------|----------------|----------|---------| | **Firecracker MicroVMs** | Strongest (own kernel) | <200ms boot, <5 MiB/VM | AWS Lambda, E2B ([e2b.dev](https://e2b.dev/)), ~50% Fortune 500 | | **gVisor** | Strong (userspace kernel) | Low | GKE Sandbox, [k8s agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) | | **nsjail** | Moderate (namespaces + seccomp) | Minimal | FunSearch evaluators (150 nodes) | | **WASM** | Moderate (no fs/network) | Near-native | Constrained execution environments | **Recommendation for bot evolution:** nsjail for high-throughput evaluation (you control the game engine); Firecracker/E2B if executing fully arbitrary LLM code with network/filesystem access. **Reference:** [github.com/restyler/awesome-sandbox](https://github.com/restyler/awesome-sandbox) --- ## 9. Recommended Architecture for AI Code Battle Based on the systems above, the **FunSearch/AlphaEvolve island model + LLM-PSRO game-theoretic selection** combination is the best fit: ``` ┌─────────────────────────────────────────────────────┐ │ Programs Database │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island 4 │ │ │ │ (Python) │ │ (Go) │ │ (Rust) │ │ (mixed) │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ └───────────────────────┬─────────────────────────────┘ │ sample 2-3 parents + match replays │ ┌─────────▼──────────┐ │ Prompt Builder │ │ • Parent code │ │ • Recent losses │ │ • Replay analysis │ │ • "Beat this meta" │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ LLM Ensemble │ │ • Fast model │ │ (exploration) │ │ • Strong model │ │ (exploitation) │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Build & Validate │ │ • Compile/lint │ │ • Schema test │ │ • Sandbox execute │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Tournament Gate │ │ • Play vs current │ │ population │ │ • Must beat Nash │ │ mixture (PSRO) │ └─────────┬──────────┘ │ promote if better │ ┌─────────▼──────────┐ │ Deploy as │ │ Container │ │ • Build image │ │ • Register bot │ │ • Enter ladder │ └────────────────────┘ ``` ### References - [FunSearch — GitHub](https://github.com/google-deepmind/funsearch) - [OpenEvolve — GitHub](https://github.com/algorithmicsuperintelligence/openevolve) - [ShinkaEvolve — GitHub](https://github.com/SakanaAI/ShinkaEvolve) - [OpenELM — GitHub](https://github.com/CarperAI/OpenELM) - [AlphaCode Dataset — GitHub](https://github.com/google-deepmind/code_contests) - [Voyager — GitHub](https://github.com/MineDojo/Voyager) - [LLM-PSRO — IJCAI 2025](https://www.ijcai.org/proceedings/2025/1249) - [CATArena — GitHub](https://github.com/AGI-Eval-Official/CATArena) - [AlphaCodium — GitHub](https://github.com/Codium-ai/AlphaCodium) - [STOP — GitHub](https://github.com/microsoft/stop) - [Awesome LLM Game Agent Papers](https://github.com/git-disl/awesome-LLM-game-agent-papers) - [Awesome Self-Evolving Agents](https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents) - [nsjail — GitHub](https://github.com/google/nsjail) - [E2B Sandbox](https://e2b.dev/) - [Awesome Sandbox](https://github.com/restyler/awesome-sandbox)