ai-code-battle/docs/research/llm-bot-evolution.md
jedarden decae849c7 Add research on LLM-driven bot evolution systems
Covers FunSearch, AlphaEvolve (+ open-source clones OpenEvolve,
ShinkaEvolve), ELM/OpenELM, AlphaCode, Voyager, LLM-PSRO, CATArena,
and sandboxing options. Recommends FunSearch island model + LLM-PSRO
Nash equilibrium selection for the evolution pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:46:23 -04:00

296 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM-Driven Bot Evolution — Research
## Overview
Survey of real systems where LLMs iteratively generate, evaluate, and evolve
code. Focus on approaches applicable to evolving game bot strategies.
---
## 1. FunSearch (DeepMind, 2023)
**Paper:** [Nature](https://www.nature.com/articles/s41586-023-06924-6)
**Code:** [github.com/google-deepmind/funsearch](https://github.com/google-deepmind/funsearch)
The cleanest template for "evolve code via LLM."
**Evolutionary loop:**
1. User provides an `evaluate` function (fitness scorer) and a trivial seed
implementation of the function to evolve.
2. **Programs Database** stores all scored programs using an **island model**
multiple independent populations evolving in parallel to maintain diversity.
Within each island, programs are clustered by score signature. Sampling
favors high-scoring clusters; within a cluster, favors shorter programs.
3. **Samplers** pull 23 high-scoring programs from the database, combine into
a prompt, and query a pre-trained LLM to generate a new candidate function.
4. **Evaluators** execute the candidate against test inputs and score it. If
correct, it enters the database.
5. Repeat. System runs 15 samplers and 150 CPU evaluators in parallel.
**Results:** Discovered new constructions for the cap set problem (open math)
and beat human heuristics for online bin-packing.
**Relevance:** The evaluator maps directly to "run the bot in the game engine,
score the match." Island model prevents convergence to one strategy.
---
## 2. AlphaEvolve (DeepMind, 2025) — FunSearch's Successor
**Paper:** [arxiv.org/abs/2506.13131](https://arxiv.org/abs/2506.13131)
Evolves entire codebases (not just single functions). Uses an **ensemble** of
Gemini Flash (exploration/breadth) and Gemini Pro (exploitation/depth).
**Results:** Recovered 0.7% of Google's worldwide compute by optimizing data
center scheduling. Found new matrix multiplication algorithms beating 1969
Strassen result.
**Open-source implementations:**
| Project | URL | Notes |
|---------|-----|-------|
| **OpenEvolve** | [github.com/algorithmicsuperintelligence/openevolve](https://github.com/algorithmicsuperintelligence/openevolve) | `pip install openevolve`. Full pipeline: prompt sampler, LLM ensemble, evaluator pool, program database with MAP-Elites + island model |
| **ShinkaEvolve** (Sakana AI) | [github.com/SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) | Apache 2.0. Adds novelty rejection-sampling, bandit-based LLM selection. Won ICFP 2025 Programming Contest. Found SOTA for circle packing in ~150 evaluations |
| **OpenAlpha_Evolve** | [github.com/shyamsaktawat/OpenAlpha_Evolve](https://github.com/shyamsaktawat/OpenAlpha_Evolve) | Community implementation |
---
## 3. ELM / OpenELM (Lehman et al., OpenAI, 2022)
**Paper:** [arxiv.org/pdf/2206.08896](https://arxiv.org/pdf/2206.08896)
**Code:** [github.com/CarperAI/OpenELM](https://github.com/CarperAI/OpenELM)
**Key insight:** Treats the LLM as a **diff/mutation operator**, not a
from-scratch generator. Uses commit-message-style prompts so the model
understands what kind of change is being requested.
**Architecture:**
1. **LLM mutation operator** — generates code diffs, not complete programs
2. **MAP-Elites outer loop** — maintains a grid of niches spanning user-defined
behavior dimensions. Each niche holds the best-performing individual. New
candidates replace niche inhabitants only if they score higher.
3. **LLM fine-tuning on successful mutations** — model updated based on which
mutations worked, closing the loop.
**Results:** Generated hundreds of thousands of functional Python programs
producing working robots in the Sodarace domain — a domain the LLM had never
seen in training.
**Relevance:** MAP-Elites ensures diversity of strategies (not just one dominant
approach). The diff-based mutation is practical for evolving bot code — smaller
changes, faster iteration.
---
## 4. AlphaCode / AlphaCode 2 (DeepMind)
**AlphaCode:** [science.org/doi/10.1126/science.abq1158](https://www.science.org/doi/10.1126/science.abq1158)
**AlphaCode 2:** [Technical Report](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf)
Not evolutionary — purely generative with massive oversampling and filtering:
1. Generate ~1 million candidate solutions per problem
2. Filter by executing against example test cases (eliminates ~99%)
3. Cluster remaining by behavioral similarity (run on synthetic inputs, group
programs producing identical outputs)
4. Select one representative per cluster, rank by scoring model, submit top 10
**AlphaCode 2:** 85th percentile on Codeforces (vs ~50th for v1). Uses Gemini
Pro with dedicated scoring/reranking model.
**Relevance:** The "generate many, filter by execution, cluster by behavior"
pattern is useful for the initial seeding phase — generate many candidate bots,
test them all, keep the diverse winners.
---
## 5. Voyager (NVIDIA / MineDojo, 2023)
**Paper:** [arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291)
**Code:** [github.com/MineDojo/Voyager](https://github.com/MineDojo/Voyager)
LLM agent that writes its own code in Minecraft.
**Three components:**
1. **Automatic Curriculum** — generates increasingly difficult objectives
2. **Skill Library** — persistent, growing collection of verified code snippets.
New tasks retrieve relevant skills by embedding similarity. Successful new
skills get added.
3. **Iterative Prompting with Self-Verification:**
- GPT-4 generates code for a task
- Code executes in environment
- Environment feedback + errors collected
- Self-verification module (also GPT-4) checks completion
- If not complete, feedback fed back and code refined
- Only verified-successful skills enter the library
**Results:** 3.3x more unique items, 15.3x faster tech-tree progression vs
prior SOTA. Skills transfer to new worlds.
**Relevance:** The skill library pattern — decomposing strategy into reusable
verified components — could apply to bot strategies (e.g., verified pathfinding,
verified formation combat, verified scouting behaviors composed into full bots).
---
## 6. Game-Bot-Specific Systems
### LLM-PSRO (IJCAI 2025) — Most Directly Relevant
**Paper:** [ijcai.org/proceedings/2025/1249](https://www.ijcai.org/proceedings/2025/1249)
The most directly applicable system. Uses Policy Space Response Oracle:
1. Start with a population of bots (hand-written or LLM-generated)
2. Run round-robin tournaments → build payoff matrix
3. Compute **Nash equilibrium** mixture over the current population
4. Prompt the LLM to generate a new bot that **beats the Nash mixture**,
providing the losing bot's code and match results as context
5. Add new bot to population
6. Repeat
**Why Nash matters:** You can only add bots that improve the population's
game-theoretic profile. This is mathematically principled regression prevention
— the new bot must beat the optimal mixed strategy, not just one opponent.
### CATArena (2025)
**Paper:** [arxiv.org/abs/2510.26852](https://arxiv.org/abs/2510.26852)
**Code:** [github.com/AGI-Eval-Official/CATArena](https://github.com/AGI-Eval-Official/CATArena)
LLM code agents play Gomoku, Texas Hold'em, Chess, Bridge. Agents refine
through **self-reflection** (analyzing own losses) and **peer-learning**
(reading opponent code).
**Key finding:** Evolutionary potential doesn't correlate with initial
proficiency — some weaker initial agents evolve faster.
### AlphaCodium (CodiumAI / Qodo)
**Paper:** [arxiv.org/abs/2401.08500](https://arxiv.org/abs/2401.08500)
**Code:** [github.com/Codium-ai/AlphaCodium](https://github.com/Codium-ai/AlphaCodium)
Two-phase iterative flow: pre-processing (self-reflection, test reasoning)
then generate-execute-refine against tests. Boosted GPT-4 from 19% to 44%
on CodeContests with only 1520 LLM calls per solution.
### STOP — Self-Taught Optimizer (Microsoft Research, COLM 2024)
**Paper:** [arxiv.org/abs/2310.02304](https://arxiv.org/abs/2310.02304)
**Code:** [github.com/microsoft/stop](https://github.com/microsoft/stop)
A seed "improver" program that calls GPT-4 to improve code, then is run on
itself to improve the improver. The self-improved improver discovers strategies
like beam search, genetic algorithms, and simulated annealing — on its own.
---
## 7. Evaluation / Selection Mechanisms
| System | Selection Mechanism |
|--------|---------------------|
| **FunSearch / AlphaEvolve** | Island model with score-based cluster sampling. Higher scores sampled more. Islands prevent premature convergence. |
| **OpenELM** | MAP-Elites quality-diversity grid. New candidate replaces niche inhabitant only if it scores higher. Maintains diversity across behavior dimensions. |
| **AlphaCode** | Generate millions, filter by execution, cluster by behavior, rank by scoring model. Pure oversampling, no evolution. |
| **LLM-PSRO** | Nash equilibrium over population. New bots must beat the Nash mixture. Theoretically grounded regression prevention. |
| **CATArena** | Dual-metric: static proficiency vs evolutionary potential. Global win rate preferred over Elo for stability. |
| **ShinkaEvolve** | Parent sampling balancing exploration/exploitation + novelty rejection-sampling (rejects candidates too similar to existing population). |
---
## 8. Code Sandboxing for LLM-Generated Code
| Solution | Isolation Level | Overhead | Used By |
|----------|----------------|----------|---------|
| **Firecracker MicroVMs** | Strongest (own kernel) | <200ms boot, <5 MiB/VM | AWS Lambda, E2B ([e2b.dev](https://e2b.dev/)), ~50% Fortune 500 |
| **gVisor** | Strong (userspace kernel) | Low | GKE Sandbox, [k8s agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) |
| **nsjail** | Moderate (namespaces + seccomp) | Minimal | FunSearch evaluators (150 nodes) |
| **WASM** | Moderate (no fs/network) | Near-native | Constrained execution environments |
**Recommendation for bot evolution:** nsjail for high-throughput evaluation
(you control the game engine); Firecracker/E2B if executing fully arbitrary
LLM code with network/filesystem access.
**Reference:** [github.com/restyler/awesome-sandbox](https://github.com/restyler/awesome-sandbox)
---
## 9. Recommended Architecture for AI Code Battle
Based on the systems above, the **FunSearch/AlphaEvolve island model +
LLM-PSRO game-theoretic selection** combination is the best fit:
```
┌─────────────────────────────────────────────────────┐
│ Programs Database │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island 4 │ │
│ │ (Python) │ │ (Go) │ │ (Rust) │ │ (mixed) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────────┘
sample 2-3 parents + match replays
┌─────────▼──────────┐
│ Prompt Builder │
│ • Parent code │
│ • Recent losses │
│ • Replay analysis │
│ • "Beat this meta" │
└─────────┬──────────┘
┌─────────▼──────────┐
│ LLM Ensemble │
│ • Fast model │
│ (exploration) │
│ • Strong model │
│ (exploitation) │
└─────────┬──────────┘
┌─────────▼──────────┐
│ Build & Validate │
│ • Compile/lint │
│ • Schema test │
│ • Sandbox execute │
└─────────┬──────────┘
┌─────────▼──────────┐
│ Tournament Gate │
│ • Play vs current │
│ population │
│ • Must beat Nash │
│ mixture (PSRO) │
└─────────┬──────────┘
promote if better
┌─────────▼──────────┐
│ Deploy as │
│ Container │
│ • Build image │
│ • Register bot │
│ • Enter ladder │
└────────────────────┘
```
### References
- [FunSearch — GitHub](https://github.com/google-deepmind/funsearch)
- [OpenEvolve — GitHub](https://github.com/algorithmicsuperintelligence/openevolve)
- [ShinkaEvolve — GitHub](https://github.com/SakanaAI/ShinkaEvolve)
- [OpenELM — GitHub](https://github.com/CarperAI/OpenELM)
- [AlphaCode Dataset — GitHub](https://github.com/google-deepmind/code_contests)
- [Voyager — GitHub](https://github.com/MineDojo/Voyager)
- [LLM-PSRO — IJCAI 2025](https://www.ijcai.org/proceedings/2025/1249)
- [CATArena — GitHub](https://github.com/AGI-Eval-Official/CATArena)
- [AlphaCodium — GitHub](https://github.com/Codium-ai/AlphaCodium)
- [STOP — GitHub](https://github.com/microsoft/stop)
- [Awesome LLM Game Agent Papers](https://github.com/git-disl/awesome-LLM-game-agent-papers)
- [Awesome Self-Evolving Agents](https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents)
- [nsjail — GitHub](https://github.com/google/nsjail)
- [E2B Sandbox](https://e2b.dev/)
- [Awesome Sandbox](https://github.com/restyler/awesome-sandbox)