Covers FunSearch, AlphaEvolve (+ open-source clones OpenEvolve, ShinkaEvolve), ELM/OpenELM, AlphaCode, Voyager, LLM-PSRO, CATArena, and sandboxing options. Recommends FunSearch island model + LLM-PSRO Nash equilibrium selection for the evolution pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
LLM-Driven Bot Evolution — Research
Overview
Survey of real systems where LLMs iteratively generate, evaluate, and evolve code. Focus on approaches applicable to evolving game bot strategies.
1. FunSearch (DeepMind, 2023)
Paper: Nature Code: github.com/google-deepmind/funsearch
The cleanest template for "evolve code via LLM."
Evolutionary loop:
- User provides an
evaluatefunction (fitness scorer) and a trivial seed implementation of the function to evolve. - Programs Database stores all scored programs using an island model — multiple independent populations evolving in parallel to maintain diversity. Within each island, programs are clustered by score signature. Sampling favors high-scoring clusters; within a cluster, favors shorter programs.
- Samplers pull 2–3 high-scoring programs from the database, combine into a prompt, and query a pre-trained LLM to generate a new candidate function.
- Evaluators execute the candidate against test inputs and score it. If correct, it enters the database.
- Repeat. System runs 15 samplers and 150 CPU evaluators in parallel.
Results: Discovered new constructions for the cap set problem (open math) and beat human heuristics for online bin-packing.
Relevance: The evaluator maps directly to "run the bot in the game engine, score the match." Island model prevents convergence to one strategy.
2. AlphaEvolve (DeepMind, 2025) — FunSearch's Successor
Paper: arxiv.org/abs/2506.13131
Evolves entire codebases (not just single functions). Uses an ensemble of Gemini Flash (exploration/breadth) and Gemini Pro (exploitation/depth).
Results: Recovered 0.7% of Google's worldwide compute by optimizing data center scheduling. Found new matrix multiplication algorithms beating 1969 Strassen result.
Open-source implementations:
| Project | URL | Notes |
|---|---|---|
| OpenEvolve | github.com/algorithmicsuperintelligence/openevolve | pip install openevolve. Full pipeline: prompt sampler, LLM ensemble, evaluator pool, program database with MAP-Elites + island model |
| ShinkaEvolve (Sakana AI) | github.com/SakanaAI/ShinkaEvolve | Apache 2.0. Adds novelty rejection-sampling, bandit-based LLM selection. Won ICFP 2025 Programming Contest. Found SOTA for circle packing in ~150 evaluations |
| OpenAlpha_Evolve | github.com/shyamsaktawat/OpenAlpha_Evolve | Community implementation |
3. ELM / OpenELM (Lehman et al., OpenAI, 2022)
Paper: arxiv.org/pdf/2206.08896 Code: github.com/CarperAI/OpenELM
Key insight: Treats the LLM as a diff/mutation operator, not a from-scratch generator. Uses commit-message-style prompts so the model understands what kind of change is being requested.
Architecture:
- LLM mutation operator — generates code diffs, not complete programs
- MAP-Elites outer loop — maintains a grid of niches spanning user-defined behavior dimensions. Each niche holds the best-performing individual. New candidates replace niche inhabitants only if they score higher.
- LLM fine-tuning on successful mutations — model updated based on which mutations worked, closing the loop.
Results: Generated hundreds of thousands of functional Python programs producing working robots in the Sodarace domain — a domain the LLM had never seen in training.
Relevance: MAP-Elites ensures diversity of strategies (not just one dominant approach). The diff-based mutation is practical for evolving bot code — smaller changes, faster iteration.
4. AlphaCode / AlphaCode 2 (DeepMind)
AlphaCode: science.org/doi/10.1126/science.abq1158 AlphaCode 2: Technical Report
Not evolutionary — purely generative with massive oversampling and filtering:
- Generate ~1 million candidate solutions per problem
- Filter by executing against example test cases (eliminates ~99%)
- Cluster remaining by behavioral similarity (run on synthetic inputs, group programs producing identical outputs)
- Select one representative per cluster, rank by scoring model, submit top 10
AlphaCode 2: 85th percentile on Codeforces (vs ~50th for v1). Uses Gemini Pro with dedicated scoring/reranking model.
Relevance: The "generate many, filter by execution, cluster by behavior" pattern is useful for the initial seeding phase — generate many candidate bots, test them all, keep the diverse winners.
5. Voyager (NVIDIA / MineDojo, 2023)
Paper: arxiv.org/abs/2305.16291 Code: github.com/MineDojo/Voyager
LLM agent that writes its own code in Minecraft.
Three components:
- Automatic Curriculum — generates increasingly difficult objectives
- Skill Library — persistent, growing collection of verified code snippets. New tasks retrieve relevant skills by embedding similarity. Successful new skills get added.
- Iterative Prompting with Self-Verification:
- GPT-4 generates code for a task
- Code executes in environment
- Environment feedback + errors collected
- Self-verification module (also GPT-4) checks completion
- If not complete, feedback fed back and code refined
- Only verified-successful skills enter the library
Results: 3.3x more unique items, 15.3x faster tech-tree progression vs prior SOTA. Skills transfer to new worlds.
Relevance: The skill library pattern — decomposing strategy into reusable verified components — could apply to bot strategies (e.g., verified pathfinding, verified formation combat, verified scouting behaviors composed into full bots).
6. Game-Bot-Specific Systems
LLM-PSRO (IJCAI 2025) — Most Directly Relevant
Paper: ijcai.org/proceedings/2025/1249
The most directly applicable system. Uses Policy Space Response Oracle:
- Start with a population of bots (hand-written or LLM-generated)
- Run round-robin tournaments → build payoff matrix
- Compute Nash equilibrium mixture over the current population
- Prompt the LLM to generate a new bot that beats the Nash mixture, providing the losing bot's code and match results as context
- Add new bot to population
- Repeat
Why Nash matters: You can only add bots that improve the population's game-theoretic profile. This is mathematically principled regression prevention — the new bot must beat the optimal mixed strategy, not just one opponent.
CATArena (2025)
Paper: arxiv.org/abs/2510.26852 Code: github.com/AGI-Eval-Official/CATArena
LLM code agents play Gomoku, Texas Hold'em, Chess, Bridge. Agents refine through self-reflection (analyzing own losses) and peer-learning (reading opponent code).
Key finding: Evolutionary potential doesn't correlate with initial proficiency — some weaker initial agents evolve faster.
AlphaCodium (CodiumAI / Qodo)
Paper: arxiv.org/abs/2401.08500 Code: github.com/Codium-ai/AlphaCodium
Two-phase iterative flow: pre-processing (self-reflection, test reasoning) then generate-execute-refine against tests. Boosted GPT-4 from 19% to 44% on CodeContests with only 15–20 LLM calls per solution.
STOP — Self-Taught Optimizer (Microsoft Research, COLM 2024)
Paper: arxiv.org/abs/2310.02304 Code: github.com/microsoft/stop
A seed "improver" program that calls GPT-4 to improve code, then is run on itself to improve the improver. The self-improved improver discovers strategies like beam search, genetic algorithms, and simulated annealing — on its own.
7. Evaluation / Selection Mechanisms
| System | Selection Mechanism |
|---|---|
| FunSearch / AlphaEvolve | Island model with score-based cluster sampling. Higher scores sampled more. Islands prevent premature convergence. |
| OpenELM | MAP-Elites quality-diversity grid. New candidate replaces niche inhabitant only if it scores higher. Maintains diversity across behavior dimensions. |
| AlphaCode | Generate millions, filter by execution, cluster by behavior, rank by scoring model. Pure oversampling, no evolution. |
| LLM-PSRO | Nash equilibrium over population. New bots must beat the Nash mixture. Theoretically grounded regression prevention. |
| CATArena | Dual-metric: static proficiency vs evolutionary potential. Global win rate preferred over Elo for stability. |
| ShinkaEvolve | Parent sampling balancing exploration/exploitation + novelty rejection-sampling (rejects candidates too similar to existing population). |
8. Code Sandboxing for LLM-Generated Code
| Solution | Isolation Level | Overhead | Used By |
|---|---|---|---|
| Firecracker MicroVMs | Strongest (own kernel) | <200ms boot, <5 MiB/VM | AWS Lambda, E2B (e2b.dev), ~50% Fortune 500 |
| gVisor | Strong (userspace kernel) | Low | GKE Sandbox, k8s agent-sandbox |
| nsjail | Moderate (namespaces + seccomp) | Minimal | FunSearch evaluators (150 nodes) |
| WASM | Moderate (no fs/network) | Near-native | Constrained execution environments |
Recommendation for bot evolution: nsjail for high-throughput evaluation (you control the game engine); Firecracker/E2B if executing fully arbitrary LLM code with network/filesystem access.
Reference: github.com/restyler/awesome-sandbox
9. Recommended Architecture for AI Code Battle
Based on the systems above, the FunSearch/AlphaEvolve island model + LLM-PSRO game-theoretic selection combination is the best fit:
┌─────────────────────────────────────────────────────┐
│ Programs Database │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island 4 │ │
│ │ (Python) │ │ (Go) │ │ (Rust) │ │ (mixed) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────────┘
│
sample 2-3 parents + match replays
│
┌─────────▼──────────┐
│ Prompt Builder │
│ • Parent code │
│ • Recent losses │
│ • Replay analysis │
│ • "Beat this meta" │
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ LLM Ensemble │
│ • Fast model │
│ (exploration) │
│ • Strong model │
│ (exploitation) │
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ Build & Validate │
│ • Compile/lint │
│ • Schema test │
│ • Sandbox execute │
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ Tournament Gate │
│ • Play vs current │
│ population │
│ • Must beat Nash │
│ mixture (PSRO) │
└─────────┬──────────┘
│
promote if better
│
┌─────────▼──────────┐
│ Deploy as │
│ Container │
│ • Build image │
│ • Register bot │
│ • Enter ladder │
└────────────────────┘