Move web tier to Cloudflare free plan, compute to Rackspace Spot

Architecture split: Cloudflare Pages (static site), Worker (API + cron scheduling), D1 (SQLite database), R2 (replays + JSON indexes, zero egress) — all within the free tier with 95%+ headroom on every quota. Rackspace Spot handles match workers, bot containers, and the evolution pipeline — all stateless and interruptible. Includes D1 schema, Worker cron design (matchmaker, indexer, health checker, reaper), R2 bucket layout, free tier usage math, and graceful degradation model. Drops infrastructure cost from ~$65-110/mo to ~$35-70/mo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 22:16:56 -04:00 · 2026-03-23 22:16:56 -04:00 · e41597f65b
commit e41597f65b
parent 512dfc201d
1 changed files with 431 additions and 335 deletions
--- a/docs/plan/plan.md
+++ b/docs/plan/plan.md
@ -16,71 +16,72 @@ implementations for the HTTP protocol.

 ## 2. System Architecture

-The platform is designed around a **static-first** principle: the website is a
-static site that loads JSON data files directly. There is no application server
-rendering pages. All dynamic state (leaderboards, match history, bot profiles)
-is pre-computed as JSON files by backend processes and served as static assets
-alongside the HTML/JS/CSS.
+The platform is split across two tiers:
+
+1. **Cloudflare (free tier)** — all web-facing infrastructure: static site,
+   API endpoints, database, file storage, and scheduling logic
+2. **Rackspace Spot** — all compute: match execution, bot hosting, evolution
+   pipeline
+
+This split maps cleanly to each provider's strength. Cloudflare excels at
+serving content globally with zero egress cost. Rackspace Spot provides cheap
+interruptible compute for the CPU-intensive match simulation.

 ```
-┌─────────────────────────────────────────────────────────────────────┐
-│                    Static Website (Nginx)                            │
-│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────────┐ │
-│  │  Leaderboard  │  │ Match History │  │    Replay Viewer (Canvas) │ │
-│  │  (loads JSON) │  │  (loads JSON)│  │    (loads replay JSON)    │ │
-│  └──────────────┘  └──────────────┘  └───────────────────────────┘ │
-└──────────────────────────────┬──────────────────────────────────────┘
-                               │ fetches static JSON files
-                    ┌──────────▼──────────┐
-                    │  Data Directory      │
-                    │  (filesystem, served │
-                    │   by Nginx)          │
-                    │  ┌────────────────┐  │
-                    │  │ /replays/*.json│  │
-                    │  │ /data/         │  │
-                    │  │   leaderboard  │  │
-                    │  │   matches      │  │
-                    │  │   bots         │  │
-                    │  │ /maps/*.json   │  │
-                    │  └────────────────┘  │
-                    └──────────┬───────────┘
-                               │ writes JSON to disk
-              ┌────────────────┼────────────────┐
-              │                │                │
-     ┌────────▼───────┐ ┌─────▼──────┐ ┌───────▼────────┐
-     │  Match Workers  │ │ Scheduler  │ │  Registration   │
-     │  (run matches,  │ │ (create    │ │  API (minimal,  │
-     │   POST results  │ │  jobs,     │ │  bot signup +   │
-     │   to scheduler) │ │  rebuild   │ │  health check)  │
-     │                 │ │  indexes)  │ │                 │
-     └────────┬───────┘ └────────────┘ └────────────────┘
-              │ HTTP (per-turn requests)
-              │
-     ┌────────▼──────────────────────────────┐
-     │  Bot Endpoints                         │
-     │  ┌────────────┐ ┌──────────┐ ┌──────┐ │
-     │  │ Participant │ │ Built-in │ │ EVO  │ │
-     │  │ Bots       │ │ Strategy │ │ Bots │ │
-     │  │ (external) │ │ Bots     │ │      │ │
-     │  └────────────┘ └──────────┘ └──────┘ │
-     └───────────────────────────────────────┘
+┌─────────────────────── Cloudflare (free tier) ───────────────────────┐
+│                                                                       │
+│  ┌─────────────┐   ┌──────────────────┐   ┌───────────────────────┐  │
+│  │  Pages       │   │  Worker (acb-api) │   │  R2 Bucket            │  │
+│  │  static site │   │  registration,    │   │  replays/*.json.gz    │  │
+│  │  HTML/JS/CSS │   │  job coordination,│   │  data/leaderboard.json│  │
+│  │              │   │  cron triggers    │   │  data/bots/*.json     │  │
+│  └──────┬──────┘   └────────┬─────────┘   │  data/matches/*.json  │  │
+│         │                   │              │  maps/*.json          │  │
+│         │ fetches JSON      │ reads/writes └───────────┬───────────┘  │
+│         └───────────────────┼─────────────────────────►│              │
+│                             │                                         │
+│                    ┌────────▼────────┐                                │
+│                    │  D1 Database     │                                │
+│                    │  bots, matches,  │                                │
+│                    │  jobs, ratings   │                                │
+│                    └─────────────────┘                                │
+└──────────────────────────────┬───────────────────────────────────────┘
+                               │ HTTPS (job coordination + result submission)
+                               │
+┌──────────────────────── Rackspace Spot ──────────────────────────────┐
+│                                                                       │
+│  ┌──────────────────┐    ┌──────────────────────────────────────────┐ │
+│  │  Match Workers    │    │  Bot Containers                          │ │
+│  │  (claim jobs,     │───►│  ┌──────────┐ ┌──────────┐ ┌──────────┐│ │
+│  │   run simulation, │HTTP│  │ Strategy  │ │ Evolved  │ │ External ││ │
+│  │   upload replay   │    │  │ Bots (×6) │ │ Bots     │ │ Bots     ││ │
+│  │   to R2, POST     │    │  └──────────┘ └──────────┘ └──────────┘│ │
+│  │   result to API)  │    └──────────────────────────────────────────┘ │
+│  └──────────────────┘                                                 │
+│                                                                       │
+│  ┌──────────────────┐                                                 │
+│  │  Evolver          │                                                │
+│  │  (LLM pipeline,  │                                                 │
+│  │   sandbox, eval)  │                                                │
+│  └──────────────────┘                                                 │
+└──────────────────────────────────────────────────────────────────────┘
 ```

 ### Component Summary

-| Component | Role | Scaling Model |
-|-----------|------|---------------|
-| Static Website | HTML/JS/CSS SPA served by Nginx — fetches JSON data files for all dynamic content | Single Nginx on stable instance |
-| Data Directory | All platform data as static JSON files on disk — replays, leaderboard, match index, bot profiles, maps — served by Nginx as a second virtual host | Persistent volume on stable instance |
-| Match Worker | Runs game simulation, POSTs replay + result JSON back to scheduler | Stateless containers on Rackspace Spot |
-| Scheduler | Creates match jobs, receives results from workers, writes JSON to data directory, rebuilds indexes | Single process, always-on |
-| Registration API | Minimal HTTP endpoint for bot signup, health check, secret generation | Single lightweight process, always-on |
-| Strategy Bots | Built-in HTTP bots (one container each) | Always-on, lightweight |
+| Component | Where | Role |
+|-----------|-------|------|
+| **Pages** | Cloudflare | Static site — HTML/JS/CSS SPA, fetches JSON from R2 |
+| **Worker** | Cloudflare | API endpoints (registration, job coordination) + cron triggers (matchmaking, index rebuilds, health checks) |
+| **D1** | Cloudflare | SQLite database — bot registry, match queue, ratings, results |
+| **R2** | Cloudflare | Object storage — replay files, pre-built JSON indexes (leaderboard, bot profiles, match lists), maps |
+| **Match Workers** | Rackspace Spot | Stateless match execution — claim job from Worker API, run simulation, upload replay to R2, POST result |
+| **Bot Containers** | Rackspace Spot | Strategy bots (×6) + evolved bots (0–50) — HTTP servers called by workers during matches |
+| **Evolver** | Rackspace Spot | Evolution pipeline — LLM generation, sandbox validation, evaluation matches |

-**What's intentionally absent:** no PostgreSQL, no Redis, no Minio, no object
-store, no read replicas, no PgBouncer, no WebSocket server. The data layer is
-flat JSON files on the filesystem, served directly by Nginx. The scheduler
-writes files to disk and workers submit results over HTTP.
+**What's intentionally absent:** no PostgreSQL, no Redis, no always-on VPS for
+web infrastructure, no Nginx, no reverse proxy. Cloudflare handles TLS, CDN,
+DNS, storage, and compute-at-edge for the entire web-facing tier at zero cost.

 ---

@ -864,10 +865,10 @@ encoding — only recording events that changed from the previous turn.

 ### 7.2 Storage

-Replays are plain JSON files on disk, served directly to the browser by
-Nginx as static assets. No API intermediary.
+Replays are stored in **Cloudflare R2** and served to the browser via R2's
+custom domain with zero egress cost. No API intermediary for reads.

-**Data directory layout** (`/var/acb/data/`, served by Nginx):
+**R2 bucket layout** (public-read via custom domain):
 ```
 replays/{match_id}.json.gz           # individual replay files
 maps/{map_id}.json                   # map definitions
@ -881,33 +882,25 @@ data/evolution/meta.json             # current meta/Nash snapshot
 ```

 **How data flows:**
-1. Match worker completes a match → POSTs replay JSON + result JSON to the
-   scheduler's ingest endpoint over Tailscale
-2. Scheduler writes the files to disk (data directory)
-3. Scheduler rebuilds `leaderboard.json`, updates `bots/{bot_id}.json`,
-   appends to `matches/index.json`
-4. Nginx serves the data directory — the static site fetches files directly
+1. Match worker completes a match → uploads `replay.json.gz` directly to R2
+   via S3-compatible API (worker has a scoped R2 API token)
+2. Worker POSTs small result metadata to the Cloudflare Worker API endpoint
+3. Worker API writes match result to D1
+4. Index rebuilder cron (every 2 min) reads new results from D1, rebuilds
+   `leaderboard.json`, `bots/*.json`, `matches/index.json`, writes to R2
+5. Static site (Pages) fetches these JSON files from R2's custom domain

 **Retention:**
 - Indefinite for top-100 matches per month
- Older replays pruned after 90 days (metadata kept)
+- Older replays pruned after 90 days (metadata in D1 kept)
 - Index files are append-with-rotation: `index.json` holds the last 1000;
  older pages at `index-{page}.json`

-**Nginx config** (simplified):
-```nginx
-server {
-    server_name data.aicodebattle.com;
-    root /var/acb/data;
-    autoindex off;
-
-    location / {
-        add_header Access-Control-Allow-Origin *;
-        add_header Cache-Control "public, max-age=60";
-        gzip_static on;  # serve .json.gz files directly
-    }
-}
-```
+**R2 free tier usage at this scale:**
+- Writes (Class A): ~43K/month (replays + index rebuilds) vs 1M limit
+- Reads (Class B): ~30K/month (page views loading JSON) vs 10M limit
+- Storage: ~3–5 GB after 90 days (well under 10 GB limit)
+- Egress: always free, unlimited

 ### 7.3 Browser Replay Viewer

@ -915,14 +908,14 @@ The replay viewer is a client-side TypeScript application rendered on
 HTML5 Canvas.

 **Rendering pipeline:**
-1. Fetch `replay.json.gz` directly from the data directory (Nginx serves it;
-   browser handles gzip decompression via `Accept-Encoding`)
+1. Fetch `replay.json.gz` from R2 custom domain (zero egress cost; browser
+   handles gzip decompression via `Accept-Encoding`)
 2. Parse and index: build per-turn game state by replaying events from turn 0
 3. Render the current turn to canvas
 4. User controls advance/rewind the turn index

-No API calls involved — the viewer is a pure static page loading a static
-JSON file.
+No Worker invocations — the viewer is a static Pages page loading a file
+directly from R2.

 **Visual design:**

@ -957,102 +950,194 @@ replay viewer is the landing page for any match. No login required to watch.

 ## 8. Web Platform

-The website is a **static site** — HTML, JS, CSS, and nothing else. Every
-dynamic-looking page (leaderboard, match history, bot profiles) works by
-fetching pre-built JSON files from the data directory and rendering them
-client-side. There is no server-side rendering, no session management, no
-database queries at page-load time.
+The web-facing platform runs entirely on Cloudflare's free tier: **Pages**
+for the static site, a **Worker** for the API and scheduling logic, **D1**
+for the database, and **R2** for file storage.

-### 8.1 Static Site Structure
+### 8.1 Cloudflare Pages (Static Site)
+
+The website is a static SPA deployed to Cloudflare Pages. Every page that
+shows dynamic content fetches pre-built JSON files from R2 and renders
+client-side.

 ```
 /                          → Landing page, featured replays, leaderboard summary
-/leaderboard               → Full leaderboard (fetches /data/leaderboard.json)
-/matches                   → Match history (fetches /data/matches/index.json)
-/replay/{match_id}         → Replay viewer (fetches /replays/{match_id}.json.gz)
-/bot/{bot_id}              → Bot profile (fetches /data/bots/{bot_id}.json)
-/evolution                 → Evolution dashboard (fetches /data/evolution/*.json)
-/register                  → Bot registration form (submits to Registration API)
+/leaderboard               → Full leaderboard (fetches leaderboard.json from R2)
+/matches                   → Match history (fetches matches/index.json from R2)
+/replay/{match_id}         → Replay viewer (fetches replay .json.gz from R2)
+/bot/{bot_id}              → Bot profile (fetches bots/{bot_id}.json from R2)
+/evolution                 → Evolution dashboard (fetches evolution/*.json from R2)
+/register                  → Bot registration form (submits to Worker API)
 /docs                      → Protocol spec, starter kit links, getting started
 ```

-**Build:** the static site is built once (e.g., Vite + vanilla TS, or a
-lightweight framework) and deployed as files to CDN or an Nginx container.
-No build-time data fetching — all data is loaded at runtime from the
-data directory served by Nginx.
+**Build:** Vite + TypeScript, deployed via `wrangler pages deploy` or git
+integration. 500 builds/month on the free tier (ample for daily deploys).
+No build-time data fetching — all data loaded at runtime.

-**Data loading pattern:** every page that shows dynamic data does:
+**Data loading pattern:**
 ```js
-const data = await fetch('https://aicodebattle.com/data/leaderboard.json')
+const R2_BASE = 'https://data.aicodebattle.com'
+const data = await fetch(`${R2_BASE}/data/leaderboard.json`)
 const leaderboard = await data.json()
 // render client-side
 ```

-Stale data is acceptable — the leaderboard JSON is rebuilt every few minutes
-by the scheduler. There is no real-time push. Visitors see data that is at
-most a few minutes old.
+R2 serves these files via custom domain with zero egress cost. Stale data
+is acceptable — JSON indexes are rebuilt every 2 minutes by the Worker cron.
+No real-time push. Visitors see data that is at most ~2 minutes old.

-### 8.2 Registration API
+### 8.2 Cloudflare Worker (API + Scheduling)

-The one exception to the "everything is static" rule. Bot registration
-requires a server-side process because it must:
+A single Worker (`acb-api`) handles all server-side logic. It has D1 and R2
+bindings.

-1. Generate a shared secret
-2. Perform a health check against the bot's endpoint
-3. Write the bot's record to the data store
-
-This is a **minimal HTTP service** — a single Go binary with three endpoints:
+**API endpoints (HTTP routes):**

 ```
-POST /api/register    → register a new bot
-POST /api/rotate-key  → rotate a bot's shared secret
-GET  /api/status/{id} → check bot health status
+POST /api/register         → register a new bot
+POST /api/rotate-key       → rotate a bot's shared secret
+GET  /api/status/{bot_id}  → check bot health status
+GET  /api/jobs/next         → worker claims next pending match job (authenticated)
+POST /api/jobs/{id}/result  → worker submits match result metadata (authenticated)
 ```

+**Cron triggers (5 available on free tier):**
+
+| Cron | Interval | What It Does |
+|------|----------|--------------|
+| Matchmaker | Every 1 min | Queries active bots from D1, computes pairings, inserts job rows |
+| Index rebuilder | Every 2 min | Reads new results from D1, rebuilds leaderboard.json + bot profiles + match index, writes to R2 |
+| Health checker | Every 15 min | Pings each active bot's `/health` endpoint, updates status in D1 |
+| Stale job reaper | Every 5 min | Marks jobs running >15 min as abandoned, resets to pending |
+| (reserved) | — | Available for evolution pipeline trigger |
+
+**CPU time budget (10ms free tier):**
+
+All D1 queries, R2 writes, and `fetch()` calls are I/O — they don't count
+against the 10ms CPU limit. Only JavaScript computation counts. At modest
+scale (~50 bots):
+- Matchmaking sort + pairing: <1ms CPU
+- JSON serialization for index rebuilds: <2ms CPU
+- HMAC computation for registration: <1ms CPU
+- All cron triggers fit comfortably within 10ms
+
+**Worker authentication for Rackspace endpoints:**
+
+The `/api/jobs/*` endpoints are called by Rackspace match workers. They
+authenticate with a static API key passed in the `Authorization` header.
+The key is stored in the Worker's environment variables (Cloudflare encrypted
+secrets). This prevents unauthorized job claims or result injection.
+
+### 8.3 Cloudflare D1 (Database)
+
+D1 is a serverless SQLite database accessible from the Worker.
+
+**Schema:**
+
+```sql
+CREATE TABLE bots (
+    bot_id        TEXT PRIMARY KEY,
+    name          TEXT UNIQUE NOT NULL,
+    owner         TEXT NOT NULL,
+    endpoint_url  TEXT NOT NULL,
+    shared_secret TEXT NOT NULL,  -- encrypted, see §4.4
+    status        TEXT NOT NULL DEFAULT 'pending',
+    rating_mu     REAL NOT NULL DEFAULT 1500.0,
+    rating_phi    REAL NOT NULL DEFAULT 350.0,
+    rating_sigma  REAL NOT NULL DEFAULT 0.06,
+    evolved       INTEGER NOT NULL DEFAULT 0,
+    island        TEXT,
+    generation    INTEGER,
+    description   TEXT,
+    created_at    TEXT NOT NULL,
+    last_active   TEXT
+);
+
+CREATE TABLE matches (
+    match_id      TEXT PRIMARY KEY,
+    map_id        TEXT NOT NULL,
+    status        TEXT NOT NULL DEFAULT 'pending',
+    winner        INTEGER,
+    condition     TEXT,
+    turn_count    INTEGER,
+    scores_json   TEXT,
+    created_at    TEXT NOT NULL,
+    completed_at  TEXT
+);
+
+CREATE TABLE match_participants (
+    match_id      TEXT NOT NULL,
+    bot_id        TEXT NOT NULL,
+    player_slot   INTEGER NOT NULL,
+    score         INTEGER,
+    status        TEXT,
+    PRIMARY KEY (match_id, bot_id)
+);
+
+CREATE TABLE jobs (
+    job_id        TEXT PRIMARY KEY,
+    match_id      TEXT NOT NULL,
+    status        TEXT NOT NULL DEFAULT 'pending',
+    config_json   TEXT NOT NULL,
+    claimed_at    TEXT,
+    completed_at  TEXT
+);
+
+CREATE TABLE rating_history (
+    bot_id        TEXT NOT NULL,
+    match_id      TEXT NOT NULL,
+    rating        REAL NOT NULL,
+    recorded_at   TEXT NOT NULL
+);
+```
+
+**Free tier usage at scale:**
+- Writes: ~1,500/day (match results + job state changes + ratings) vs 100K limit
+- Reads: ~50K/day (matchmaking queries + index rebuilds + API lookups) vs 5M limit
+- Storage: <100 MB after months of operation vs 5 GB limit
+
+### 8.4 Bot Registration
+
 **Registration flow:**

-1. Participant fills out a form on the static site (`/register`)
-2. Form submits to the Registration API:
+1. Participant fills out the form on the static site (`/register`)
+2. Form POSTs to the Worker: `POST /api/register`
   - **Bot name** (unique, alphanumeric + hyphens, 3–32 chars)
   - **Endpoint URL** (HTTPS required for competitive; HTTP allowed for dev)
   - **Owner name** (free text, shown on leaderboard)
   - **Description** (optional)
-3. API generates:
-   - `bot_id`: unique identifier (`b_` prefix + 8 hex chars)
-   - `shared_secret`: 256-bit random, hex-encoded (64 chars)
-4. API performs a **health check**: `GET {endpoint_url}/health`
+3. Worker generates:
+   - `bot_id`: `b_` + 8 hex chars (from `crypto.randomUUID()`)
+   - `shared_secret`: 256-bit random, hex-encoded (`crypto.getRandomValues()`)
+4. Worker performs a **health check**: `fetch(endpoint_url + '/health')`
   - Must return 200 within 5 seconds
-5. API performs a **protocol test**: sends a mock game state to
+5. Worker performs a **protocol test**: sends mock game state to
   `POST {endpoint_url}/turn` with valid HMAC
   - Must return valid moves JSON within 3 seconds
-6. API returns the `bot_id` and `shared_secret` to the participant
+6. Worker inserts bot record into D1
+7. Worker returns `bot_id` and `shared_secret` to the participant
   (displayed once — they must save it)
-7. API writes `bots/{bot_id}.json` to the data directory
-8. Scheduler picks up the new bot on its next cycle and adds it to matchmaking

-**No user accounts.** Registration is bot-level, not user-level. The owner
-name is self-reported and shown on the leaderboard. The shared secret is the
-only authentication — whoever has it controls the bot (can rotate the key or
-retire the bot). This avoids needing user auth, sessions, email verification,
-OAuth, and password storage.
+**No user accounts.** Registration is bot-level. The owner name is
+self-reported. The shared secret is the only authentication — whoever has
+it can rotate the key or retire the bot. No OAuth, no sessions, no
+password storage.

 **Bot status lifecycle:**
 ```
 PENDING → ACTIVE → INACTIVE (health check failed)
-                  → RETIRED (by owner via rotate-key endpoint with retire flag)
+                  → RETIRED (by owner via /api/rotate-key with retire flag)
 ```

-Only `ACTIVE` bots participate in matchmaking.
+Only `ACTIVE` bots participate in matchmaking. The health checker cron pings
+each active bot every 15 min. Three consecutive failures → `INACTIVE`. Bots
+automatically return to `ACTIVE` when health checks pass again.

-**Ongoing health checks:** the scheduler pings each active bot's `/health`
-endpoint every 15 minutes. Three consecutive failures → marked `INACTIVE`
-in the bot's JSON file. Bots automatically return to `ACTIVE` when health
-checks resume passing.
+### 8.5 Leaderboard

-### 8.3 Leaderboard
-
-The leaderboard is a **JSON file** (`/data/leaderboard.json`) rebuilt by the
-scheduler every few minutes after new match results arrive.
+The leaderboard is a **JSON file** in R2 (`data/leaderboard.json`) rebuilt
+by the index rebuilder cron every 2 minutes.

 ```json
 {
@ -1075,14 +1160,13 @@ scheduler every few minutes after new match results arrive.
 }
 ```

- Client-side sorting and filtering (by player count tier, time range,
-  human-only vs all)
- No real-time updates — page refresh or auto-refresh every 60 seconds
- Public — no login required
+The static site fetches this file directly from R2 (no Worker invocation).
+Client-side sorting and filtering (by player count tier, time range,
+human-only vs all). Auto-refresh every 60 seconds. Public — no login.

-### 8.4 Match History & Profiles
+### 8.6 Match History & Profiles

-**Bot profile** (`/bot/{bot_id}`) — fetches `/data/bots/{bot_id}.json`:
+**Bot profile** (`/bot/{bot_id}`) — fetches `data/bots/{bot_id}.json` from R2:
 - Current rating + rating history (array of `[timestamp, rating]` pairs
  rendered as a chart client-side)
 - Recent matches (last 50) with links to replay viewer
@ -1090,13 +1174,13 @@ scheduler every few minutes after new match results arrive.
 - Bot description, owner, registration date
 - If evolved: lineage, generation, island

-**Match list** (`/matches`) — fetches `/data/matches/index.json`:
+**Match list** (`/matches`) — fetches `data/matches/index.json` from R2:
 - Paginated list of recent matches
 - Each entry: match_id, participants, scores, date, link to replay

 **Match detail** (`/replay/{match_id}`):
- Fetches `/data/matches/{match_id}.json` for metadata
- Fetches `/replays/{match_id}.json.gz` for the replay
+- Fetches `data/matches/{match_id}.json` from R2 for metadata
+- Fetches `replays/{match_id}.json.gz` from R2 for the replay
 - Embedded replay viewer (auto-plays)
 - Score breakdown, participants, match duration

@ -1106,196 +1190,205 @@ scheduler every few minutes after new match results arrive.

 ### 9.1 Design Principles

-The deployment is designed for **Rackspace Spot** instances, which means:
+The platform is split across two providers based on their strengths:

- **Instances can be reclaimed at any time** with a short SIGTERM warning
- **Pricing is significantly cheaper** than on-demand but availability fluctuates
- **No persistent local storage** — anything on the instance disappears on reclaim
- **No guaranteed uptime** — the platform must tolerate all workers disappearing
+- **Cloudflare (free tier)** handles everything web-facing: the site, the
+  API, the database, file storage, and scheduling. This tier has zero cost,
+  zero ops burden (no servers to maintain), and global edge distribution.
+- **Rackspace Spot** handles everything compute-heavy: match execution, bot
+  hosting, and the evolution pipeline. These workloads are stateless and
+  interruptible — perfect for spot pricing.

-These constraints shape every decision below. The answer is: keep it simple,
-keep it stateless, and put all durable state in object storage.
+All durable state lives in Cloudflare (D1 + R2). Rackspace instances are
+fully ephemeral — they can be reclaimed at any time with zero data loss.

-### 9.2 Container Architecture
+### 9.2 Cloudflare Tier (Free Plan)

-| Image | Base | Purpose | Where It Runs |
-|-------|------|---------|---------------|
-| `acb-web` | Nginx + static files | Static site (HTML/JS/CSS) | CDN or single stable instance |
-| `acb-register` | Go binary on Alpine | Bot registration API (3 endpoints) | Single stable instance |
-| `acb-scheduler` | Go binary on Alpine | Matchmaking + JSON index rebuilds | Single stable instance |
-| `acb-worker` | Go binary on Alpine | Match execution | Rackspace Spot (1–10) |
-| `acb-evolver` | Go binary on Alpine | Evolution pipeline orchestrator | Rackspace Spot (1) |
-| `acb-strategy-random` | Python 3.13 slim | RandomBot | Stable instance (shared) |
-| `acb-strategy-gatherer` | Go on Alpine | GathererBot | Stable instance (shared) |
-| `acb-strategy-rusher` | Rust on Alpine | RusherBot | Stable instance (shared) |
-| `acb-strategy-guardian` | PHP 8.4 CLI Alpine | GuardianBot | Stable instance (shared) |
-| `acb-strategy-swarm` | Node 22 Alpine | SwarmBot (TypeScript) | Stable instance (shared) |
-| `acb-strategy-hunter` | Temurin 21 JRE Alpine | HunterBot (Java) | Stable instance (shared) |
-| `acb-evolved-*` | Varies by language | LLM-generated evolved bots | Stable instance (shared) |
-
-### 9.3 Rackspace Spot: What Runs Where
-
-**Stable instance (1× small, always-on):**
-
-This is the only always-on server. It runs everything that must survive
-spot reclamation:
+| Service | Usage | Free Limit | Headroom |
+|---------|-------|------------|----------|
+| **Pages** | ~1K views/day | Unlimited bandwidth + requests | Unlimited |
+| **Workers** | ~5K requests/day (API + crons) | 100K requests/day | 95% |
+| **Workers CPU** | <5ms per invocation | 10ms per invocation | 50% |
+| **R2 storage** | ~3–5 GB | 10 GB | 50–70% |
+| **R2 Class A** (writes) | ~43K/month | 1M/month | 96% |
+| **R2 Class B** (reads) | ~30K/month | 10M/month | 99.7% |
+| **R2 egress** | Unlimited | Unlimited (always free) | — |
+| **D1 writes** | ~1.5K/day | 100K/day | 98.5% |
+| **D1 reads** | ~50K/day | 5M/day | 99% |
+| **D1 storage** | <100 MB | 5 GB | 98% |
+| **Cron triggers** | 4 used | 5 per account | 1 spare |

+**Cloudflare deployment:**
 ```
-Single stable instance (2 vCPU, 4 GB RAM):
-├── acb-web (Nginx, serves static site + data directory)
-├── acb-register (registration API, lightweight)
-├── acb-scheduler (matchmaking, job coordination, index rebuilds)
+Cloudflare Account:
+├── Pages project: aicodebattle.com (static site)
+├── Worker: acb-api
+│   ├── Routes: api.aicodebattle.com/*
+│   ├── Crons: matchmaker (1m), indexer (2m), health (15m), reaper (5m)
+│   ├── D1 binding: ACB_DB
+│   └── R2 binding: ACB_DATA
+├── R2 bucket: acb-data
+│   └── Custom domain: data.aicodebattle.com (public read)
+└── D1 database: acb-db
+```
+
+**What Cloudflare handles:**
+- TLS termination (automatic, free)
+- DNS (Cloudflare nameservers)
+- CDN for static assets (Pages, global edge)
+- DDoS protection (free tier includes basic)
+- File serving with zero egress (R2)
+- Database with automatic backups (D1, 7-day Time Travel)
+
+### 9.3 Rackspace Spot Tier
+
+Everything on Rackspace is stateless and interruptible. All durable state
+is in Cloudflare (D1 + R2).
+
+**Container architecture:**
+
+| Image | Base | Purpose | Instances |
+|-------|------|---------|-----------|
+| `acb-worker` | Go binary on Alpine | Match execution | 1–10 (spot) |
+| `acb-evolver` | Go binary on Alpine | Evolution pipeline | 1 (spot) |
+| `acb-strategy-random` | Python 3.13 slim | RandomBot | 1 |
+| `acb-strategy-gatherer` | Go on Alpine | GathererBot | 1 |
+| `acb-strategy-rusher` | Rust on Alpine | RusherBot | 1 |
+| `acb-strategy-guardian` | PHP 8.4 CLI Alpine | GuardianBot | 1 |
+| `acb-strategy-swarm` | Node 22 Alpine | SwarmBot (TypeScript) | 1 |
+| `acb-strategy-hunter` | Temurin 21 JRE Alpine | HunterBot (Java) | 1 |
+| `acb-evolved-*` | Varies by language | LLM-generated bots | 0–50 |
+
+**Deployment layout:**
+```
+Spot instance A (4 vCPU, 8 GB RAM, "bot host"):
 ├── acb-strategy-* (all 6 built-in bots, ~1 GB total)
-├── acb-evolved-* (0–50 evolved bots, dynamic)
-└── /var/acb/data/ (persistent volume, JSON files served by Nginx)
-```
+└── acb-evolved-* (0–50 evolved bots, dynamic)

-The static site, registration API, scheduler, and all bot HTTP servers share
-one machine. This is feasible because:
- The static site is Nginx serving files (negligible CPU)
- The registration API handles ~10 requests/day (negligible)
- The scheduler runs once every 10 seconds (negligible)
- Strategy bots are idle between matches (only active during their turns)
-
-**Spot instances (1–10×, preemptible):**
-
-Match workers and the evolution pipeline run on spot instances. These are
-the only compute-intensive workloads.
-
-```
-Spot instance A (2 vCPU, 4 GB RAM):
+Spot instance B (2 vCPU, 4 GB RAM, "worker"):
 └── acb-worker (runs 1 match at a time)

-Spot instance B (2 vCPU, 4 GB RAM):
+Spot instance C (2 vCPU, 4 GB RAM, "worker"):
 └── acb-worker (runs 1 match at a time)

-Spot instance C (4 vCPU, 8 GB RAM):
-└── acb-evolver (evolution pipeline, needs more RAM for LLM context)
+Spot instance D (4 vCPU, 8 GB RAM, "evolver"):
+└── acb-evolver (LLM pipeline, sandbox, evaluation)
 ```

-**If all spot instances are reclaimed:**
- The website continues to work (static site on stable instance)
- Leaderboard and replays remain visible (JSON files on disk, served by Nginx)
- Bot registration still works (registration API on stable instance)
- Built-in bots remain reachable (on stable instance)
- **Only match execution pauses** — the queue accumulates jobs
- When spot instances return, workers drain the queue and catch up
- The platform gracefully degrades: visitors see stale-but-valid data
+### 9.4 Match Job Coordination

-This is the key benefit of the static-site architecture — the entire
-user-facing experience survives spot reclamation.
+Workers coordinate with the Cloudflare Worker API. The Worker + D1 are the
+single point of coordination.

-### 9.4 Data Layer: Filesystem + SQLite
-
-There is **no database server**. All platform state lives on the stable
-instance's persistent volume as files.
-
-**Why no PostgreSQL/Redis:**
- The platform has low write volume (~60 matches/hour, ~10 registrations/day)
- All "queries" are pre-computed: the scheduler builds the leaderboard JSON,
-  match index JSON, and bot profile JSONs on a schedule
- The static site just fetches these files — no query engine needed
- Eliminates connection pooling, migrations, schema management, and ORM
-
-**Internal state (scheduler's working data):**
-
-The scheduler maintains a small SQLite database for its own bookkeeping:
- Bot registry (which bots are active, their endpoints, ratings)
- Match queue (pending, in-progress, completed)
- Rating history
-
-This SQLite file lives on the stable instance's persistent volume. It is the
-**source of truth** for platform state. The JSON files in the data directory
-are **materialized views** rebuilt from SQLite.
-
-If the SQLite file is lost, it can be reconstructed from the JSON data files
-on disk (match results contain all the data needed to replay ratings).
-
-**Backup:** daily rsync of the data directory + SQLite file to offsite
-storage.
-
-### 9.5 Match Job Coordination
-
-Workers coordinate with the scheduler via HTTP. The scheduler is the single
-point of coordination — workers are pure HTTP clients.
-
-**Job assignment flow:**
-1. Scheduler creates match jobs in SQLite, marks them `pending`
-2. Worker requests a job: `GET scheduler:9090/jobs/next` (over Tailscale)
-3. Scheduler atomically assigns the job (marks `running`, records worker ID),
-   returns the job JSON (map, bot endpoints, match config)
-4. Worker executes the match (all turns, full simulation)
-5. Worker submits results: `POST scheduler:9090/jobs/{job_id}/result`
-   - Body: replay JSON + match result (scores, winner, turn count)
-6. Scheduler writes replay to data directory, updates SQLite, rebuilds
-   affected JSON index files (leaderboard, bot profiles, match list)
+**Job flow:**
+1. Matchmaker cron creates jobs in D1 (`status: 'pending'`)
+2. Rackspace worker polls: `GET api.aicodebattle.com/api/jobs/next`
+   (authenticated with API key)
+3. Worker API atomically claims the job (D1 transaction: set `status: 'running'`,
+   record `claimed_at`), returns job config JSON including:
+   - Map data (or map_id to fetch from R2)
+   - Bot endpoints + shared secrets for HMAC signing
+   - Match config (turns, radii, etc.)
+4. Rackspace worker executes the full match (500 turns, HTTP calls to bots)
+5. Worker uploads replay: `PUT` directly to R2 via S3-compatible API
+   (scoped R2 API token, `PutObject` only on `replays/` prefix)
+6. Worker submits result metadata:
+   `POST api.aicodebattle.com/api/jobs/{id}/result`
+   - Small JSON body: scores, winner, turn count, condition
+7. Worker API writes result to D1, marks job `completed`
+8. Index rebuilder cron (next 2-min cycle) reads new results, rebuilds
+   leaderboard.json + bot profiles + match index, writes to R2

 **Stale job recovery:**
- Scheduler scans for jobs in `running` state older than 15 minutes
- Assumed abandoned (worker was reclaimed by spot)
- Moved back to `pending` for re-execution
+- Reaper cron checks D1 every 5 minutes for jobs `running` >15 minutes
+- Assumed abandoned (spot instance reclaimed)
+- Reset to `pending` for re-execution

-**Why the scheduler is the coordinator:**
- Single process = no distributed coordination, no race conditions
- SQLite handles the job queue (single-writer is fine at this scale)
- Workers only need to reach one HTTP endpoint — no filesystem access,
-  no filesystem access to the data directory, no shared state
- If the scheduler is down, workers simply can't get jobs (they retry
-  with backoff) — no data corruption risk
+### 9.5 Spot Reclamation Behavior
+
+**If bot-host spot instance is reclaimed:**
+- All built-in + evolved bots go offline
+- Health checker cron detects failures, marks bots `INACTIVE` in D1
+- Matchmaker skips inactive bots — only external bots can play
+- When a new bot-host instance starts, bots come back online, health checks
+  pass, matchmaker resumes including them
+- Matches in progress where a bot disappeared: that bot times out on each
+  turn, its units hold position, it effectively loses
+
+**If all worker instances are reclaimed:**
+- Jobs accumulate as `pending` in D1
+- The website, leaderboard, and replays remain fully functional (Cloudflare)
+- When workers return, they drain the queue
+
+**If everything on Rackspace is gone simultaneously:**
+- Visitors see a working website with stale-but-valid data
+- No matches run, no bots respond to health checks
+- All bots eventually marked inactive
+- Full recovery when any Rackspace instances return
+
+The user-facing experience degrades gracefully because all web infrastructure
+is on Cloudflare, not Rackspace.

 ### 9.6 Networking & Security

-**External traffic:**
- `aicodebattle.com` → Nginx on stable instance (static site + data directory)
- `api.aicodebattle.com` → Registration API on stable instance (3 endpoints)
- All behind Caddy for automatic TLS termination
+**External traffic (Cloudflare):**
+- `aicodebattle.com` → Cloudflare Pages (static site)
+- `data.aicodebattle.com` → R2 public bucket (JSON data + replays)
+- `api.aicodebattle.com` → Cloudflare Worker (API endpoints)
+- TLS, CDN, DDoS protection all handled by Cloudflare automatically

-**Internal traffic (over Tailscale):**
- Workers → scheduler: `GET/POST scheduler:9090/jobs/*` (job coordination)
- Workers → strategy bots on stable instance: HTTP to localhost-bound ports
-  exposed via Tailscale
+**Rackspace → Cloudflare:**
+- Workers → Worker API: HTTPS to `api.aicodebattle.com` (authenticated with
+  API key in `Authorization` header)
+- Workers → R2: HTTPS via S3-compatible API (scoped R2 API token)
+
+**Rackspace → Bots (during matches):**
+- Workers → built-in/evolved bots: HTTP within Rackspace private network
+  (or Tailscale if across instances)
 - Workers → external participant bots: outbound HTTPS to registered URLs
- No inbound ports on workers from the public internet

 **Security boundaries:**
 - The game engine (workers) never executes bot code — HTTP only
 - All bot responses are schema-validated before processing
 - HMAC authentication prevents request/response forgery
- Registration API validates bot endpoint URLs (no internal IPs, no localhost,
-  no private ranges)
- Data directory is served read-only by Nginx (no write from outside)
- Scheduler's job coordination endpoint is only reachable over Tailscale
- Registration API is rate-limited (10 registrations/hour max)
+- Worker API endpoints authenticated with API key (job coordination)
+- R2 API token scoped to `PutObject` on `replays/` prefix only
+- Registration endpoint validates bot URLs (no internal IPs, no private ranges)
+- D1 is only accessible from the bound Worker (not publicly queryable)
+- R2 data bucket is public-read — contains no secrets

-### 9.7 Cost Model (Rackspace Spot)
+### 9.7 Cost Model

-| Component | Instance Type | Spot? | Est. Monthly |
-|-----------|--------------|-------|-------------|
-| Stable instance | 2 vCPU / 4 GB | No (on-demand) | ~$30–50 |
-| Match workers (×3 avg) | 2 vCPU / 4 GB each | Yes | ~$15–30 |
-| Evolver (×1) | 4 vCPU / 8 GB | Yes | ~$10–20 |
-| Persistent volume | 100 GB block storage | No | ~$10 |
-| **Total** | | | **~$65–110/mo** |
+| Component | Provider | Cost |
+|-----------|----------|------|
+| Pages + Worker + D1 + R2 | Cloudflare | **$0/mo** (free tier) |
+| Bot host (×1 avg) | Rackspace Spot | ~$10–20/mo |
+| Match workers (×2–3 avg) | Rackspace Spot | ~$15–30/mo |
+| Evolver (×1) | Rackspace Spot | ~$10–20/mo |
+| **Infrastructure total** | | **~$35–70/mo** |
+| LLM API (evolution pipeline) | Various | ~$150–600/mo |

-LLM API costs for the evolution pipeline are separate and depend on model
-choice and generation volume. At ~96 candidates/day with a mix of fast/strong
-models, estimate ~$5–20/day ($150–600/mo).
+Compared to the previous architecture ($65–110/mo), moving the web tier to
+Cloudflare saves ~$30–40/mo (the stable instance) and eliminates all web
+infrastructure ops (no Nginx config, no TLS certs, no volume management,
+no backup scripts for the data directory).

 ### 9.8 Monitoring

-Monitoring is lightweight, matching the simple architecture:
-
 | Signal | Method | Alert |
 |--------|--------|-------|
-| Stable instance up | External ping (UptimeRobot or similar) | Down >2 minutes |
-| Active spot workers | Scheduler tracks last worker heartbeat | 0 workers for >30 minutes |
-| Match throughput | Scheduler counts completions per hour | <10 matches/hour for >1 hour |
-| Data volume disk usage | `df` on persistent volume | >80% |
-| Bot health failures | Scheduler's health check log | >50% of bots failing |
-| Stale jobs | Scheduler's reaper count | >10 stale jobs in a cycle |
+| Site up | Cloudflare analytics (built-in) | Auto |
+| Worker errors | Cloudflare Worker analytics | Error rate >5% |
+| D1 usage | Cloudflare dashboard | Approaching free tier limits |
+| R2 storage | Cloudflare dashboard | >8 GB (approaching 10 GB) |
+| Active Rackspace workers | Worker API tracks last job claim time | No claim in >30 min |
+| Match throughput | D1 query: completions per hour | <10/hour for >1 hour |
+| Bot health failures | D1 query in health checker cron | >50% failing |
+| Stale jobs | Reaper cron count | >10 stale in a cycle |

-Alerts via webhook to a notification channel (Slack, Discord, or email).
-No Prometheus/Grafana stack needed at this scale.
+Alerts via Worker → webhook to Discord/Slack. No external monitoring
+service needed — Cloudflare provides built-in analytics for Pages, Workers,
+R2, and D1.

 ---

@ -1758,50 +1851,53 @@ match with all visual elements rendering correctly.
 ### Phase 4: Match Orchestration

 **Deliverables:**
- Match worker service (`acb-worker`): pulls jobs from scheduler over HTTP,
-  runs matches, POSTs replay + result JSON back to scheduler
- Scheduler (`acb-scheduler`): matchmaking algorithm, serves jobs to workers,
-  receives results, updates SQLite, rebuilds leaderboard/index JSON files
-  in the data directory
- Scheduler's SQLite schema for bot registry, match queue, and ratings
- Stale job reaper (recovers abandoned jobs from reclaimed spot instances)
- Match result → Glicko-2 rating update pipeline
- JSON index rebuilder: leaderboard.json, matches/index.json, bots/*.json
+- Cloudflare Worker (`acb-api`): job coordination endpoints
+  (`/api/jobs/next`, `/api/jobs/{id}/result`), authenticated with API key
+- D1 schema: `bots`, `matches`, `match_participants`, `jobs`,
+  `rating_history` tables
+- Worker cron: matchmaker (1 min), stale job reaper (5 min)
+- Worker cron: index rebuilder (2 min) — reads D1, writes leaderboard.json +
+  bot profiles + match index to R2
+- Match worker container (`acb-worker`): claims jobs from Worker API, runs
+  matches, uploads replays to R2 via S3 API, POSTs results to Worker API
+- Glicko-2 rating update logic in the Worker (runs on result submission)

-**Exit criteria:** scheduler creates match jobs as files, workers pick them
-up and execute autonomously, results flow back as JSON, ratings update, and
-all index files rebuild correctly. System recovers from worker disappearance.
+**Exit criteria:** matchmaker cron creates jobs in D1, Rackspace workers claim
+and execute them, replays land in R2, results flow into D1, ratings update,
+and leaderboard.json rebuilds automatically. System recovers from worker
+disappearance via the stale job reaper.

 ### Phase 5: Web Platform

 **Deliverables:**
- Static site (`acb-web`): leaderboard, match history, bot profiles, replay
-  viewer, registration form, docs/getting-started page
- Registration API (`acb-register`): bot signup, health check, key rotation
-  (3 endpoints, single Go binary)
- Bot health check loop in the scheduler (periodic pings)
- All pages load data by fetching JSON from the data directory — no backend
-  rendering
+- Cloudflare Pages static site: leaderboard, match history, bot profiles,
+  replay viewer, registration form, docs/getting-started page
+- Worker API: registration endpoints (`/api/register`, `/api/rotate-key`,
+  `/api/status/{id}`)
+- Worker cron: health checker (15 min) — pings bot endpoints, updates D1
+- R2 bucket with custom domain for public-read data access
+- All pages load data by fetching JSON from R2 — no Worker invocations
+  for page views

 **Exit criteria:** a participant can register a bot via the web form, the
 bot appears on the leaderboard after matches complete, and anyone can browse
-matches and watch replays — all from a static site with no application server.
+matches and watch replays — all served from Cloudflare free tier.

 ### Phase 6: Deployment & Production

 **Deliverables:**
- Container images pushed to registry
- Stable instance: Nginx, Registration API, Scheduler, all strategy
-  bots — single machine with persistent volume for data directory
- Spot instances: match workers configured to pull jobs from scheduler
- Caddy for TLS termination on the stable instance
- DNS setup (aicodebattle.com, data.aicodebattle.com, api.aicodebattle.com)
- Monitoring webhooks (uptime ping, worker count, match throughput)
- Daily rsync of data directory + SQLite to offsite storage
+- Cloudflare: Pages project, Worker deployed via Wrangler, D1 database
+  created, R2 bucket with custom domain, DNS configured
+- Rackspace Spot: match worker containers pulling jobs from Cloudflare
+  Worker API, bot-host container running all strategy bots
+- R2 API token (scoped) distributed to Rackspace workers
+- Worker API key distributed to Rackspace workers
+- Monitoring: Cloudflare analytics + Worker-based alerting webhooks

-**Exit criteria:** platform is publicly accessible, matches run on spot
-instances, the site remains fully functional when all spot instances are
-reclaimed, and external participants can register and play.
+**Exit criteria:** platform is publicly accessible on Cloudflare (zero
+infrastructure cost), matches run on Rackspace Spot, the site remains fully
+functional when all Rackspace instances are reclaimed, and external
+participants can register and play.

 ### Phase 7: LLM-Driven Evolution