Add research on LLM-driven bot evolution systems

Covers FunSearch, AlphaEvolve (+ open-source clones OpenEvolve, ShinkaEvolve), ELM/OpenELM, AlphaCode, Voyager, LLM-PSRO, CATArena, and sandboxing options. Recommends FunSearch island model + LLM-PSRO Nash equilibrium selection for the evolution pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:46:23 -04:00 · 2026-03-23 21:46:23 -04:00 · decae849c7
commit decae849c7
parent d7cf4625e2
1 changed files with 296 additions and 0 deletions
--- a/docs/research/llm-bot-evolution.md
+++ b/docs/research/llm-bot-evolution.md
@ -0,0 +1,296 @@
+# LLM-Driven Bot Evolution — Research
+
+## Overview
+
+Survey of real systems where LLMs iteratively generate, evaluate, and evolve
+code. Focus on approaches applicable to evolving game bot strategies.
+
+---
+
+## 1. FunSearch (DeepMind, 2023)
+
+**Paper:** [Nature](https://www.nature.com/articles/s41586-023-06924-6)
+**Code:** [github.com/google-deepmind/funsearch](https://github.com/google-deepmind/funsearch)
+
+The cleanest template for "evolve code via LLM."
+
+**Evolutionary loop:**
+
+1. User provides an `evaluate` function (fitness scorer) and a trivial seed
+   implementation of the function to evolve.
+2. **Programs Database** stores all scored programs using an **island model** —
+   multiple independent populations evolving in parallel to maintain diversity.
+   Within each island, programs are clustered by score signature. Sampling
+   favors high-scoring clusters; within a cluster, favors shorter programs.
+3. **Samplers** pull 2–3 high-scoring programs from the database, combine into
+   a prompt, and query a pre-trained LLM to generate a new candidate function.
+4. **Evaluators** execute the candidate against test inputs and score it. If
+   correct, it enters the database.
+5. Repeat. System runs 15 samplers and 150 CPU evaluators in parallel.
+
+**Results:** Discovered new constructions for the cap set problem (open math)
+and beat human heuristics for online bin-packing.
+
+**Relevance:** The evaluator maps directly to "run the bot in the game engine,
+score the match." Island model prevents convergence to one strategy.
+
+---
+
+## 2. AlphaEvolve (DeepMind, 2025) — FunSearch's Successor
+
+**Paper:** [arxiv.org/abs/2506.13131](https://arxiv.org/abs/2506.13131)
+
+Evolves entire codebases (not just single functions). Uses an **ensemble** of
+Gemini Flash (exploration/breadth) and Gemini Pro (exploitation/depth).
+
+**Results:** Recovered 0.7% of Google's worldwide compute by optimizing data
+center scheduling. Found new matrix multiplication algorithms beating 1969
+Strassen result.
+
+**Open-source implementations:**
+
+| Project | URL | Notes |
+|---------|-----|-------|
+| **OpenEvolve** | [github.com/algorithmicsuperintelligence/openevolve](https://github.com/algorithmicsuperintelligence/openevolve) | `pip install openevolve`. Full pipeline: prompt sampler, LLM ensemble, evaluator pool, program database with MAP-Elites + island model |
+| **ShinkaEvolve** (Sakana AI) | [github.com/SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) | Apache 2.0. Adds novelty rejection-sampling, bandit-based LLM selection. Won ICFP 2025 Programming Contest. Found SOTA for circle packing in ~150 evaluations |
+| **OpenAlpha_Evolve** | [github.com/shyamsaktawat/OpenAlpha_Evolve](https://github.com/shyamsaktawat/OpenAlpha_Evolve) | Community implementation |
+
+---
+
+## 3. ELM / OpenELM (Lehman et al., OpenAI, 2022)
+
+**Paper:** [arxiv.org/pdf/2206.08896](https://arxiv.org/pdf/2206.08896)
+**Code:** [github.com/CarperAI/OpenELM](https://github.com/CarperAI/OpenELM)
+
+**Key insight:** Treats the LLM as a **diff/mutation operator**, not a
+from-scratch generator. Uses commit-message-style prompts so the model
+understands what kind of change is being requested.
+
+**Architecture:**
+
+1. **LLM mutation operator** — generates code diffs, not complete programs
+2. **MAP-Elites outer loop** — maintains a grid of niches spanning user-defined
+   behavior dimensions. Each niche holds the best-performing individual. New
+   candidates replace niche inhabitants only if they score higher.
+3. **LLM fine-tuning on successful mutations** — model updated based on which
+   mutations worked, closing the loop.
+
+**Results:** Generated hundreds of thousands of functional Python programs
+producing working robots in the Sodarace domain — a domain the LLM had never
+seen in training.
+
+**Relevance:** MAP-Elites ensures diversity of strategies (not just one dominant
+approach). The diff-based mutation is practical for evolving bot code — smaller
+changes, faster iteration.
+
+---
+
+## 4. AlphaCode / AlphaCode 2 (DeepMind)
+
+**AlphaCode:** [science.org/doi/10.1126/science.abq1158](https://www.science.org/doi/10.1126/science.abq1158)
+**AlphaCode 2:** [Technical Report](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf)
+
+Not evolutionary — purely generative with massive oversampling and filtering:
+
+1. Generate ~1 million candidate solutions per problem
+2. Filter by executing against example test cases (eliminates ~99%)
+3. Cluster remaining by behavioral similarity (run on synthetic inputs, group
+   programs producing identical outputs)
+4. Select one representative per cluster, rank by scoring model, submit top 10
+
+**AlphaCode 2:** 85th percentile on Codeforces (vs ~50th for v1). Uses Gemini
+Pro with dedicated scoring/reranking model.
+
+**Relevance:** The "generate many, filter by execution, cluster by behavior"
+pattern is useful for the initial seeding phase — generate many candidate bots,
+test them all, keep the diverse winners.
+
+---
+
+## 5. Voyager (NVIDIA / MineDojo, 2023)
+
+**Paper:** [arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291)
+**Code:** [github.com/MineDojo/Voyager](https://github.com/MineDojo/Voyager)
+
+LLM agent that writes its own code in Minecraft.
+
+**Three components:**
+
+1. **Automatic Curriculum** — generates increasingly difficult objectives
+2. **Skill Library** — persistent, growing collection of verified code snippets.
+   New tasks retrieve relevant skills by embedding similarity. Successful new
+   skills get added.
+3. **Iterative Prompting with Self-Verification:**
+   - GPT-4 generates code for a task
+   - Code executes in environment
+   - Environment feedback + errors collected
+   - Self-verification module (also GPT-4) checks completion
+   - If not complete, feedback fed back and code refined
+   - Only verified-successful skills enter the library
+
+**Results:** 3.3x more unique items, 15.3x faster tech-tree progression vs
+prior SOTA. Skills transfer to new worlds.
+
+**Relevance:** The skill library pattern — decomposing strategy into reusable
+verified components — could apply to bot strategies (e.g., verified pathfinding,
+verified formation combat, verified scouting behaviors composed into full bots).
+
+---
+
+## 6. Game-Bot-Specific Systems
+
+### LLM-PSRO (IJCAI 2025) — Most Directly Relevant
+
+**Paper:** [ijcai.org/proceedings/2025/1249](https://www.ijcai.org/proceedings/2025/1249)
+
+The most directly applicable system. Uses Policy Space Response Oracle:
+
+1. Start with a population of bots (hand-written or LLM-generated)
+2. Run round-robin tournaments → build payoff matrix
+3. Compute **Nash equilibrium** mixture over the current population
+4. Prompt the LLM to generate a new bot that **beats the Nash mixture**,
+   providing the losing bot's code and match results as context
+5. Add new bot to population
+6. Repeat
+
+**Why Nash matters:** You can only add bots that improve the population's
+game-theoretic profile. This is mathematically principled regression prevention
+— the new bot must beat the optimal mixed strategy, not just one opponent.
+
+### CATArena (2025)
+
+**Paper:** [arxiv.org/abs/2510.26852](https://arxiv.org/abs/2510.26852)
+**Code:** [github.com/AGI-Eval-Official/CATArena](https://github.com/AGI-Eval-Official/CATArena)
+
+LLM code agents play Gomoku, Texas Hold'em, Chess, Bridge. Agents refine
+through **self-reflection** (analyzing own losses) and **peer-learning**
+(reading opponent code).
+
+**Key finding:** Evolutionary potential doesn't correlate with initial
+proficiency — some weaker initial agents evolve faster.
+
+### AlphaCodium (CodiumAI / Qodo)
+
+**Paper:** [arxiv.org/abs/2401.08500](https://arxiv.org/abs/2401.08500)
+**Code:** [github.com/Codium-ai/AlphaCodium](https://github.com/Codium-ai/AlphaCodium)
+
+Two-phase iterative flow: pre-processing (self-reflection, test reasoning)
+then generate-execute-refine against tests. Boosted GPT-4 from 19% to 44%
+on CodeContests with only 15–20 LLM calls per solution.
+
+### STOP — Self-Taught Optimizer (Microsoft Research, COLM 2024)
+
+**Paper:** [arxiv.org/abs/2310.02304](https://arxiv.org/abs/2310.02304)
+**Code:** [github.com/microsoft/stop](https://github.com/microsoft/stop)
+
+A seed "improver" program that calls GPT-4 to improve code, then is run on
+itself to improve the improver. The self-improved improver discovers strategies
+like beam search, genetic algorithms, and simulated annealing — on its own.
+
+---
+
+## 7. Evaluation / Selection Mechanisms
+
+| System | Selection Mechanism |
+|--------|---------------------|
+| **FunSearch / AlphaEvolve** | Island model with score-based cluster sampling. Higher scores sampled more. Islands prevent premature convergence. |
+| **OpenELM** | MAP-Elites quality-diversity grid. New candidate replaces niche inhabitant only if it scores higher. Maintains diversity across behavior dimensions. |
+| **AlphaCode** | Generate millions, filter by execution, cluster by behavior, rank by scoring model. Pure oversampling, no evolution. |
+| **LLM-PSRO** | Nash equilibrium over population. New bots must beat the Nash mixture. Theoretically grounded regression prevention. |
+| **CATArena** | Dual-metric: static proficiency vs evolutionary potential. Global win rate preferred over Elo for stability. |
+| **ShinkaEvolve** | Parent sampling balancing exploration/exploitation + novelty rejection-sampling (rejects candidates too similar to existing population). |
+
+---
+
+## 8. Code Sandboxing for LLM-Generated Code
+
+| Solution | Isolation Level | Overhead | Used By |
+|----------|----------------|----------|---------|
+| **Firecracker MicroVMs** | Strongest (own kernel) | <200ms boot, <5 MiB/VM | AWS Lambda, E2B ([e2b.dev](https://e2b.dev/)), ~50% Fortune 500 |
+| **gVisor** | Strong (userspace kernel) | Low | GKE Sandbox, [k8s agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) |
+| **nsjail** | Moderate (namespaces + seccomp) | Minimal | FunSearch evaluators (150 nodes) |
+| **WASM** | Moderate (no fs/network) | Near-native | Constrained execution environments |
+
+**Recommendation for bot evolution:** nsjail for high-throughput evaluation
+(you control the game engine); Firecracker/E2B if executing fully arbitrary
+LLM code with network/filesystem access.
+
+**Reference:** [github.com/restyler/awesome-sandbox](https://github.com/restyler/awesome-sandbox)
+
+---
+
+## 9. Recommended Architecture for AI Code Battle
+
+Based on the systems above, the **FunSearch/AlphaEvolve island model +
+LLM-PSRO game-theoretic selection** combination is the best fit:
+
+```
+┌─────────────────────────────────────────────────────┐
+│                   Programs Database                  │
+│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐  │
+│  │ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island 4 │  │
+│  │ (Python) │ │ (Go)     │ │ (Rust)   │ │ (mixed)  │  │
+│  └─────────┘ └─────────┘ └─────────┘ └─────────┘  │
+└───────────────────────┬─────────────────────────────┘
+                        │
+          sample 2-3 parents + match replays
+                        │
+              ┌─────────▼──────────┐
+              │   Prompt Builder    │
+              │  • Parent code      │
+              │  • Recent losses    │
+              │  • Replay analysis  │
+              │  • "Beat this meta" │
+              └─────────┬──────────┘
+                        │
+              ┌─────────▼──────────┐
+              │   LLM Ensemble      │
+              │  • Fast model       │
+              │    (exploration)    │
+              │  • Strong model     │
+              │    (exploitation)   │
+              └─────────┬──────────┘
+                        │
+              ┌─────────▼──────────┐
+              │   Build & Validate  │
+              │  • Compile/lint     │
+              │  • Schema test      │
+              │  • Sandbox execute  │
+              └─────────┬──────────┘
+                        │
+              ┌─────────▼──────────┐
+              │   Tournament Gate   │
+              │  • Play vs current  │
+              │    population       │
+              │  • Must beat Nash   │
+              │    mixture (PSRO)   │
+              └─────────┬──────────┘
+                        │
+               promote if better
+                        │
+              ┌─────────▼──────────┐
+              │   Deploy as         │
+              │   Container         │
+              │  • Build image      │
+              │  • Register bot     │
+              │  • Enter ladder     │
+              └────────────────────┘
+```
+
+### References
+
+- [FunSearch — GitHub](https://github.com/google-deepmind/funsearch)
+- [OpenEvolve — GitHub](https://github.com/algorithmicsuperintelligence/openevolve)
+- [ShinkaEvolve — GitHub](https://github.com/SakanaAI/ShinkaEvolve)
+- [OpenELM — GitHub](https://github.com/CarperAI/OpenELM)
+- [AlphaCode Dataset — GitHub](https://github.com/google-deepmind/code_contests)
+- [Voyager — GitHub](https://github.com/MineDojo/Voyager)
+- [LLM-PSRO — IJCAI 2025](https://www.ijcai.org/proceedings/2025/1249)
+- [CATArena — GitHub](https://github.com/AGI-Eval-Official/CATArena)
+- [AlphaCodium — GitHub](https://github.com/Codium-ai/AlphaCodium)
+- [STOP — GitHub](https://github.com/microsoft/stop)
+- [Awesome LLM Game Agent Papers](https://github.com/git-disl/awesome-LLM-game-agent-papers)
+- [Awesome Self-Evolving Agents](https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents)
+- [nsjail — GitHub](https://github.com/google/nsjail)
+- [E2B Sandbox](https://e2b.dev/)
+- [Awesome Sandbox](https://github.com/restyler/awesome-sandbox)