Add research on LLM-driven bot evolution systems

Covers FunSearch, AlphaEvolve (+ open-source clones OpenEvolve,
ShinkaEvolve), ELM/OpenELM, AlphaCode, Voyager, LLM-PSRO, CATArena,
and sandboxing options. Recommends FunSearch island model + LLM-PSRO
Nash equilibrium selection for the evolution pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-03-23 21:46:23 -04:00
parent d7cf4625e2
commit decae849c7

View file

@ -0,0 +1,296 @@
# LLM-Driven Bot Evolution — Research
## Overview
Survey of real systems where LLMs iteratively generate, evaluate, and evolve
code. Focus on approaches applicable to evolving game bot strategies.
---
## 1. FunSearch (DeepMind, 2023)
**Paper:** [Nature](https://www.nature.com/articles/s41586-023-06924-6)
**Code:** [github.com/google-deepmind/funsearch](https://github.com/google-deepmind/funsearch)
The cleanest template for "evolve code via LLM."
**Evolutionary loop:**
1. User provides an `evaluate` function (fitness scorer) and a trivial seed
implementation of the function to evolve.
2. **Programs Database** stores all scored programs using an **island model**
multiple independent populations evolving in parallel to maintain diversity.
Within each island, programs are clustered by score signature. Sampling
favors high-scoring clusters; within a cluster, favors shorter programs.
3. **Samplers** pull 23 high-scoring programs from the database, combine into
a prompt, and query a pre-trained LLM to generate a new candidate function.
4. **Evaluators** execute the candidate against test inputs and score it. If
correct, it enters the database.
5. Repeat. System runs 15 samplers and 150 CPU evaluators in parallel.
**Results:** Discovered new constructions for the cap set problem (open math)
and beat human heuristics for online bin-packing.
**Relevance:** The evaluator maps directly to "run the bot in the game engine,
score the match." Island model prevents convergence to one strategy.
---
## 2. AlphaEvolve (DeepMind, 2025) — FunSearch's Successor
**Paper:** [arxiv.org/abs/2506.13131](https://arxiv.org/abs/2506.13131)
Evolves entire codebases (not just single functions). Uses an **ensemble** of
Gemini Flash (exploration/breadth) and Gemini Pro (exploitation/depth).
**Results:** Recovered 0.7% of Google's worldwide compute by optimizing data
center scheduling. Found new matrix multiplication algorithms beating 1969
Strassen result.
**Open-source implementations:**
| Project | URL | Notes |
|---------|-----|-------|
| **OpenEvolve** | [github.com/algorithmicsuperintelligence/openevolve](https://github.com/algorithmicsuperintelligence/openevolve) | `pip install openevolve`. Full pipeline: prompt sampler, LLM ensemble, evaluator pool, program database with MAP-Elites + island model |
| **ShinkaEvolve** (Sakana AI) | [github.com/SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) | Apache 2.0. Adds novelty rejection-sampling, bandit-based LLM selection. Won ICFP 2025 Programming Contest. Found SOTA for circle packing in ~150 evaluations |
| **OpenAlpha_Evolve** | [github.com/shyamsaktawat/OpenAlpha_Evolve](https://github.com/shyamsaktawat/OpenAlpha_Evolve) | Community implementation |
---
## 3. ELM / OpenELM (Lehman et al., OpenAI, 2022)
**Paper:** [arxiv.org/pdf/2206.08896](https://arxiv.org/pdf/2206.08896)
**Code:** [github.com/CarperAI/OpenELM](https://github.com/CarperAI/OpenELM)
**Key insight:** Treats the LLM as a **diff/mutation operator**, not a
from-scratch generator. Uses commit-message-style prompts so the model
understands what kind of change is being requested.
**Architecture:**
1. **LLM mutation operator** — generates code diffs, not complete programs
2. **MAP-Elites outer loop** — maintains a grid of niches spanning user-defined
behavior dimensions. Each niche holds the best-performing individual. New
candidates replace niche inhabitants only if they score higher.
3. **LLM fine-tuning on successful mutations** — model updated based on which
mutations worked, closing the loop.
**Results:** Generated hundreds of thousands of functional Python programs
producing working robots in the Sodarace domain — a domain the LLM had never
seen in training.
**Relevance:** MAP-Elites ensures diversity of strategies (not just one dominant
approach). The diff-based mutation is practical for evolving bot code — smaller
changes, faster iteration.
---
## 4. AlphaCode / AlphaCode 2 (DeepMind)
**AlphaCode:** [science.org/doi/10.1126/science.abq1158](https://www.science.org/doi/10.1126/science.abq1158)
**AlphaCode 2:** [Technical Report](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf)
Not evolutionary — purely generative with massive oversampling and filtering:
1. Generate ~1 million candidate solutions per problem
2. Filter by executing against example test cases (eliminates ~99%)
3. Cluster remaining by behavioral similarity (run on synthetic inputs, group
programs producing identical outputs)
4. Select one representative per cluster, rank by scoring model, submit top 10
**AlphaCode 2:** 85th percentile on Codeforces (vs ~50th for v1). Uses Gemini
Pro with dedicated scoring/reranking model.
**Relevance:** The "generate many, filter by execution, cluster by behavior"
pattern is useful for the initial seeding phase — generate many candidate bots,
test them all, keep the diverse winners.
---
## 5. Voyager (NVIDIA / MineDojo, 2023)
**Paper:** [arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291)
**Code:** [github.com/MineDojo/Voyager](https://github.com/MineDojo/Voyager)
LLM agent that writes its own code in Minecraft.
**Three components:**
1. **Automatic Curriculum** — generates increasingly difficult objectives
2. **Skill Library** — persistent, growing collection of verified code snippets.
New tasks retrieve relevant skills by embedding similarity. Successful new
skills get added.
3. **Iterative Prompting with Self-Verification:**
- GPT-4 generates code for a task
- Code executes in environment
- Environment feedback + errors collected
- Self-verification module (also GPT-4) checks completion
- If not complete, feedback fed back and code refined
- Only verified-successful skills enter the library
**Results:** 3.3x more unique items, 15.3x faster tech-tree progression vs
prior SOTA. Skills transfer to new worlds.
**Relevance:** The skill library pattern — decomposing strategy into reusable
verified components — could apply to bot strategies (e.g., verified pathfinding,
verified formation combat, verified scouting behaviors composed into full bots).
---
## 6. Game-Bot-Specific Systems
### LLM-PSRO (IJCAI 2025) — Most Directly Relevant
**Paper:** [ijcai.org/proceedings/2025/1249](https://www.ijcai.org/proceedings/2025/1249)
The most directly applicable system. Uses Policy Space Response Oracle:
1. Start with a population of bots (hand-written or LLM-generated)
2. Run round-robin tournaments → build payoff matrix
3. Compute **Nash equilibrium** mixture over the current population
4. Prompt the LLM to generate a new bot that **beats the Nash mixture**,
providing the losing bot's code and match results as context
5. Add new bot to population
6. Repeat
**Why Nash matters:** You can only add bots that improve the population's
game-theoretic profile. This is mathematically principled regression prevention
— the new bot must beat the optimal mixed strategy, not just one opponent.
### CATArena (2025)
**Paper:** [arxiv.org/abs/2510.26852](https://arxiv.org/abs/2510.26852)
**Code:** [github.com/AGI-Eval-Official/CATArena](https://github.com/AGI-Eval-Official/CATArena)
LLM code agents play Gomoku, Texas Hold'em, Chess, Bridge. Agents refine
through **self-reflection** (analyzing own losses) and **peer-learning**
(reading opponent code).
**Key finding:** Evolutionary potential doesn't correlate with initial
proficiency — some weaker initial agents evolve faster.
### AlphaCodium (CodiumAI / Qodo)
**Paper:** [arxiv.org/abs/2401.08500](https://arxiv.org/abs/2401.08500)
**Code:** [github.com/Codium-ai/AlphaCodium](https://github.com/Codium-ai/AlphaCodium)
Two-phase iterative flow: pre-processing (self-reflection, test reasoning)
then generate-execute-refine against tests. Boosted GPT-4 from 19% to 44%
on CodeContests with only 1520 LLM calls per solution.
### STOP — Self-Taught Optimizer (Microsoft Research, COLM 2024)
**Paper:** [arxiv.org/abs/2310.02304](https://arxiv.org/abs/2310.02304)
**Code:** [github.com/microsoft/stop](https://github.com/microsoft/stop)
A seed "improver" program that calls GPT-4 to improve code, then is run on
itself to improve the improver. The self-improved improver discovers strategies
like beam search, genetic algorithms, and simulated annealing — on its own.
---
## 7. Evaluation / Selection Mechanisms
| System | Selection Mechanism |
|--------|---------------------|
| **FunSearch / AlphaEvolve** | Island model with score-based cluster sampling. Higher scores sampled more. Islands prevent premature convergence. |
| **OpenELM** | MAP-Elites quality-diversity grid. New candidate replaces niche inhabitant only if it scores higher. Maintains diversity across behavior dimensions. |
| **AlphaCode** | Generate millions, filter by execution, cluster by behavior, rank by scoring model. Pure oversampling, no evolution. |
| **LLM-PSRO** | Nash equilibrium over population. New bots must beat the Nash mixture. Theoretically grounded regression prevention. |
| **CATArena** | Dual-metric: static proficiency vs evolutionary potential. Global win rate preferred over Elo for stability. |
| **ShinkaEvolve** | Parent sampling balancing exploration/exploitation + novelty rejection-sampling (rejects candidates too similar to existing population). |
---
## 8. Code Sandboxing for LLM-Generated Code
| Solution | Isolation Level | Overhead | Used By |
|----------|----------------|----------|---------|
| **Firecracker MicroVMs** | Strongest (own kernel) | <200ms boot, <5 MiB/VM | AWS Lambda, E2B ([e2b.dev](https://e2b.dev/)), ~50% Fortune 500 |
| **gVisor** | Strong (userspace kernel) | Low | GKE Sandbox, [k8s agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) |
| **nsjail** | Moderate (namespaces + seccomp) | Minimal | FunSearch evaluators (150 nodes) |
| **WASM** | Moderate (no fs/network) | Near-native | Constrained execution environments |
**Recommendation for bot evolution:** nsjail for high-throughput evaluation
(you control the game engine); Firecracker/E2B if executing fully arbitrary
LLM code with network/filesystem access.
**Reference:** [github.com/restyler/awesome-sandbox](https://github.com/restyler/awesome-sandbox)
---
## 9. Recommended Architecture for AI Code Battle
Based on the systems above, the **FunSearch/AlphaEvolve island model +
LLM-PSRO game-theoretic selection** combination is the best fit:
```
┌─────────────────────────────────────────────────────┐
│ Programs Database │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island 4 │ │
│ │ (Python) │ │ (Go) │ │ (Rust) │ │ (mixed) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────────┘
sample 2-3 parents + match replays
┌─────────▼──────────┐
│ Prompt Builder │
│ • Parent code │
│ • Recent losses │
│ • Replay analysis │
│ • "Beat this meta" │
└─────────┬──────────┘
┌─────────▼──────────┐
│ LLM Ensemble │
│ • Fast model │
│ (exploration) │
│ • Strong model │
│ (exploitation) │
└─────────┬──────────┘
┌─────────▼──────────┐
│ Build & Validate │
│ • Compile/lint │
│ • Schema test │
│ • Sandbox execute │
└─────────┬──────────┘
┌─────────▼──────────┐
│ Tournament Gate │
│ • Play vs current │
│ population │
│ • Must beat Nash │
│ mixture (PSRO) │
└─────────┬──────────┘
promote if better
┌─────────▼──────────┐
│ Deploy as │
│ Container │
│ • Build image │
│ • Register bot │
│ • Enter ladder │
└────────────────────┘
```
### References
- [FunSearch — GitHub](https://github.com/google-deepmind/funsearch)
- [OpenEvolve — GitHub](https://github.com/algorithmicsuperintelligence/openevolve)
- [ShinkaEvolve — GitHub](https://github.com/SakanaAI/ShinkaEvolve)
- [OpenELM — GitHub](https://github.com/CarperAI/OpenELM)
- [AlphaCode Dataset — GitHub](https://github.com/google-deepmind/code_contests)
- [Voyager — GitHub](https://github.com/MineDojo/Voyager)
- [LLM-PSRO — IJCAI 2025](https://www.ijcai.org/proceedings/2025/1249)
- [CATArena — GitHub](https://github.com/AGI-Eval-Official/CATArena)
- [AlphaCodium — GitHub](https://github.com/Codium-ai/AlphaCodium)
- [STOP — GitHub](https://github.com/microsoft/stop)
- [Awesome LLM Game Agent Papers](https://github.com/git-disl/awesome-LLM-game-agent-papers)
- [Awesome Self-Evolving Agents](https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents)
- [nsjail — GitHub](https://github.com/google/nsjail)
- [E2B Sandbox](https://e2b.dev/)
- [Awesome Sandbox](https://github.com/restyler/awesome-sandbox)