Rework architecture for static site + Rackspace Spot practicalities

Major changes: - Website is now a static site loading JSON from object store, no app server - Removed PostgreSQL, Redis, PgBouncer, read replicas — replaced with Minio (S3-compatible) for all data and SQLite on the scheduler for bookkeeping - Registration is a minimal 3-endpoint API, no user accounts needed - Match job coordination via file moves in Minio (no message broker) - All user-facing data is pre-built JSON fetched client-side - Single stable instance runs everything except match workers - Spot workers are fully stateless — site stays up when all are reclaimed - Added cost model (~$65-110/mo excluding LLM API) - Simplified monitoring to match the architecture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:59:17 -04:00 · 2026-03-23 21:59:17 -04:00 · 85a2a3915e
commit 85a2a3915e
parent 63212f02ab
1 changed files with 396 additions and 199 deletions
--- a/docs/plan/plan.md
+++ b/docs/plan/plan.md
@ -16,58 +16,70 @@ implementations for the HTTP protocol.

 ## 2. System Architecture

+The platform is designed around a **static-first** principle: the website is a
+static site that loads JSON data files directly. There is no application server
+rendering pages. All dynamic state (leaderboards, match history, bot profiles)
+is pre-computed as JSON files by backend processes and served as static assets
+alongside the HTML/JS/CSS.
+
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
-│                         Web Platform                                │
+│                    Static Website (CDN / Nginx)                     │
 │  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────────┐ │
 │  │  Leaderboard  │  │ Match History │  │    Replay Viewer (Canvas) │ │
+│  │  (loads JSON) │  │  (loads JSON)│  │    (loads replay JSON)    │ │
 │  └──────────────┘  └──────────────┘  └───────────────────────────┘ │
 └──────────────────────────────┬──────────────────────────────────────┘
-                               │ HTTPS
+                               │ fetches static JSON files
                    ┌──────────▼──────────┐
-                    │     API Server       │
-                    │  (user reg, bot reg, │
-                    │  leaderboard, replay │
-                    │   metadata, health)  │
+                    │  Object Store       │
+                    │  (S3-compat/Minio)  │
+                    │  ┌────────────────┐ │
+                    │  │ /replays/*.json│ │
+                    │  │ /data/         │ │
+                    │  │   leaderboard  │ │
+                    │  │   matches      │ │
+                    │  │   bots         │ │
+                    │  │ /maps/*.json   │ │
+                    │  └────────────────┘ │
                    └──────────┬──────────┘
-                               │
+                               │ writes JSON
              ┌────────────────┼────────────────┐
              │                │                │
     ┌────────▼───────┐ ┌─────▼──────┐ ┌───────▼────────┐
-     │   PostgreSQL    │ │  Match      │ │  Object Store  │
-     │   (users, bots, │ │  Queue      │ │  (S3-compat)   │
-     │   matches,      │ │  (Redis)    │ │  replay JSON   │
-     │   ratings)      │ │             │ │  + map data     │
-     └────────────────┘ └─────┬──────┘ └────────────────┘
-                              │
-                    ┌─────────▼─────────┐
-                    │   Match Workers    │  ← Rackspace Spot instances
-                    │   (stateless,      │
-                    │    interruptible)   │
-                    └─────────┬─────────┘
-                              │ HTTP (per-turn requests)
-              ┌───────────────┼───────────────┐
-              │               │               │
-     ┌────────▼──────┐ ┌─────▼─────┐ ┌───────▼──────┐
-     │ Participant    │ │ Built-in   │ │ Participant   │
-     │ Bot A          │ │ Strategy   │ │ Bot B         │
-     │ (external)     │ │ Bots       │ │ (external)    │
-     └───────────────┘ │ (containers)│ └──────────────┘
-                       └────────────┘
+     │  Match Workers  │ │ Scheduler  │ │  Registration   │
+     │  (run matches,  │ │ (create    │ │  API (minimal,  │
+     │   write replay  │ │  jobs,     │ │  bot signup +   │
+     │   + result JSON)│ │  rebuild   │ │  health check)  │
+     │                 │ │  indexes)  │ │                 │
+     └────────┬───────┘ └────────────┘ └────────────────┘
+              │ HTTP (per-turn requests)
+              │
+     ┌────────▼──────────────────────────────┐
+     │  Bot Endpoints                         │
+     │  ┌────────────┐ ┌──────────┐ ┌──────┐ │
+     │  │ Participant │ │ Built-in │ │ EVO  │ │
+     │  │ Bots       │ │ Strategy │ │ Bots │ │
+     │  │ (external) │ │ Bots     │ │      │ │
+     │  └────────────┘ └──────────┘ └──────┘ │
+     └───────────────────────────────────────┘
 ```

 ### Component Summary

 | Component | Role | Scaling Model |
 |-----------|------|---------------|
-| API Server | REST API for web platform, bot registration, match metadata | Horizontally scaled, always-on |
-| Match Worker | Pulls match jobs from queue, executes full game simulation, uploads replay | Stateless pods on Rackspace Spot |
-| Tournament Scheduler | Creates match jobs based on matchmaking algorithm | Single process, cron-like |
-| Web Frontend | Static SPA — replay viewer, leaderboard, registration | CDN / static hosting |
+| Static Website | HTML/JS/CSS SPA that fetches JSON data files — leaderboard, match lists, replays | CDN or single Nginx container |
+| Object Store | All platform data as static JSON files — replays, leaderboard, match index, bot profiles, maps | S3-compatible (Minio or provider) |
+| Match Worker | Pulls match jobs, runs game simulation, writes replay JSON + updates result index | Stateless containers on Rackspace Spot |
+| Scheduler | Creates match jobs, rebuilds leaderboard/index JSON after matches complete | Single process, always-on |
+| Registration API | Minimal HTTP endpoint for bot signup, health check, secret generation | Single lightweight process, always-on |
 | Strategy Bots | Built-in HTTP bots (one container each) | Always-on, lightweight |
-| PostgreSQL | Users, bots, matches, ratings | Single primary + read replica |
-| Redis | Match job queue, rate limiting, caching | Single instance |
-| Object Store | Replay JSON files, map definitions | S3-compatible (Minio or provider) |
+
+**What's intentionally absent:** no PostgreSQL, no Redis, no read replicas, no
+PgBouncer, no WebSocket server. The data layer is flat JSON files in object
+storage. The scheduler and workers coordinate through the filesystem (object
+store) and a simple SQLite database or flat-file job queue on the scheduler.

 ---

@ -851,13 +863,39 @@ encoding — only recording events that changed from the previous turn.

 ### 7.2 Storage

- Replays are stored in S3-compatible object storage (Minio self-hosted or
-  provider-managed)
- Path: `replays/{year}/{month}/{match_id}.json.gz`
- Retention: indefinite for top-100 matches per month; older matches pruned
-  after 90 days
- Map definitions stored separately: `maps/{map_id}.json`
- The API server returns signed URLs for replay access (no public bucket)
+Replays are plain JSON files in object storage, served directly to the browser
+as static assets. No API intermediary.
+
+**Object store layout:**
+```
+/replays/{match_id}.json.gz          # individual replay files
+/maps/{map_id}.json                  # map definitions
+/data/matches/index.json             # paginated match list (last 1000)
+/data/matches/{match_id}.json        # match metadata (participants, scores)
+/data/leaderboard.json               # current leaderboard snapshot
+/data/bots/{bot_id}.json             # per-bot profile (rating history, recent matches)
+/data/bots/index.json                # bot directory
+/data/evolution/lineage.json         # evolution lineage graph
+/data/evolution/meta.json            # current meta/Nash snapshot
+```
+
+**How data flows:**
+1. Match worker completes a match → writes `replay.json.gz` to object store
+2. Worker writes `match metadata.json` to object store
+3. Scheduler detects new results → rebuilds `leaderboard.json`, updates
+   `bots/{bot_id}.json`, appends to `matches/index.json`
+4. Static site fetches these JSON files directly from the object store via
+   public read URLs
+
+**Retention:**
+- Indefinite for top-100 matches per month
+- Older replays pruned after 90 days (metadata kept)
+- Index files are append-with-rotation: `index.json` holds the last 1000;
+  older pages at `index-{page}.json`
+
+**No signed URLs needed** — the data bucket is public-read. Replays contain
+no secrets (HMAC secrets are never in replay data; game state is already
+fog-of-war filtered per the match, and replays show the omniscient view).

 ### 7.3 Browser Replay Viewer

@ -865,11 +903,15 @@ The replay viewer is a client-side TypeScript application rendered on
 HTML5 Canvas.

 **Rendering pipeline:**
-1. Fetch replay JSON from object storage (via signed URL from API)
+1. Fetch `replay.json.gz` directly from object store (public URL, browser
+   handles gzip decompression via `Accept-Encoding`)
 2. Parse and index: build per-turn game state by replaying events from turn 0
 3. Render the current turn to canvas
 4. User controls advance/rewind the turn index

+No API calls involved — the viewer is a pure static page loading a static
+JSON file.
+
 **Visual design:**

 | Element | Rendering |
@ -903,199 +945,347 @@ replay viewer is the landing page for any match. No login required to watch.

 ## 8. Web Platform

-### 8.1 User Registration
+The website is a **static site** — HTML, JS, CSS, and nothing else. Every
+dynamic-looking page (leaderboard, match history, bot profiles) works by
+fetching pre-built JSON files from the object store and rendering them
+client-side. There is no server-side rendering, no session management, no
+database queries at page-load time.

- Email + password or OAuth (GitHub recommended — target audience is developers)
- Email verification required before bot registration
- Profile: username (unique), display name, avatar (from OAuth provider)
+### 8.1 Static Site Structure

-### 8.2 Bot Registration
+```
+/                          → Landing page, featured replays, leaderboard summary
+/leaderboard               → Full leaderboard (fetches /data/leaderboard.json)
+/matches                   → Match history (fetches /data/matches/index.json)
+/replay/{match_id}         → Replay viewer (fetches /replays/{match_id}.json.gz)
+/bot/{bot_id}              → Bot profile (fetches /data/bots/{bot_id}.json)
+/evolution                 → Evolution dashboard (fetches /data/evolution/*.json)
+/register                  → Bot registration form (submits to Registration API)
+/docs                      → Protocol spec, starter kit links, getting started
+```
+
+**Build:** the static site is built once (e.g., Vite + vanilla TS, or a
+lightweight framework) and deployed as files to CDN or an Nginx container.
+No build-time data fetching — all data is loaded at runtime from the
+object store.
+
+**Data loading pattern:** every page that shows dynamic data does:
+```js
+const data = await fetch('https://data.aicodebattle.com/data/leaderboard.json')
+const leaderboard = await data.json()
+// render client-side
+```
+
+Stale data is acceptable — the leaderboard JSON is rebuilt every few minutes
+by the scheduler. There is no real-time push. Visitors see data that is at
+most a few minutes old.
+
+### 8.2 Registration API
+
+The one exception to the "everything is static" rule. Bot registration
+requires a server-side process because it must:
+
+1. Generate a shared secret
+2. Perform a health check against the bot's endpoint
+3. Write the bot's record to the data store
+
+This is a **minimal HTTP service** — a single Go binary with three endpoints:
+
+```
+POST /api/register    → register a new bot
+POST /api/rotate-key  → rotate a bot's shared secret
+GET  /api/status/{id} → check bot health status
+```

 **Registration flow:**

-1. User navigates to "Register Bot" in their dashboard
-2. Provides:
+1. Participant fills out a form on the static site (`/register`)
+2. Form submits to the Registration API:
   - **Bot name** (unique, alphanumeric + hyphens, 3–32 chars)
-   - **Endpoint URL** (HTTPS required for competitive play; HTTP allowed for
-     development with a flag)
-   - **Description** (optional, shown on leaderboard)
-3. Platform generates:
+   - **Endpoint URL** (HTTPS required for competitive; HTTP allowed for dev)
+   - **Owner name** (free text, shown on leaderboard)
+   - **Description** (optional)
+3. API generates:
   - `bot_id`: unique identifier (`b_` prefix + 8 hex chars)
   - `shared_secret`: 256-bit random, hex-encoded (64 chars)
-4. Platform displays the shared secret **once** — user must copy it
-5. Platform performs a **health check**: `GET {endpoint_url}/health`
+4. API performs a **health check**: `GET {endpoint_url}/health`
   - Must return 200 within 5 seconds
-   - If health check fails, registration is saved but bot is marked **inactive**
-6. Platform performs a **protocol test**: sends a mock turn-0 game state to
+5. API performs a **protocol test**: sends a mock game state to
   `POST {endpoint_url}/turn` with valid HMAC
-   - Bot must return a valid (possibly empty) moves response within 3 seconds
-   - If protocol test fails, bot is marked **inactive** with an error message
+   - Must return valid moves JSON within 3 seconds
+6. API returns the `bot_id` and `shared_secret` to the participant
+   (displayed once — they must save it)
+7. API writes `bots/{bot_id}.json` to the object store
+8. Scheduler picks up the new bot on its next cycle and adds it to matchmaking
+
+**No user accounts.** Registration is bot-level, not user-level. The owner
+name is self-reported and shown on the leaderboard. The shared secret is the
+only authentication — whoever has it controls the bot (can rotate the key or
+retire the bot). This avoids needing user auth, sessions, email verification,
+OAuth, and password storage.

 **Bot status lifecycle:**
 ```
 PENDING → ACTIVE → INACTIVE (health check failed)
-                  → SUSPENDED (manual by admin)
-                  → RETIRED (by owner)
+                  → RETIRED (by owner via rotate-key endpoint with retire flag)
 ```

 Only `ACTIVE` bots participate in matchmaking.

-**Ongoing health checks:** the platform pings each active bot's `/health`
-endpoint every 15 minutes. Three consecutive failures → marked `INACTIVE`.
-Bots automatically return to `ACTIVE` when health checks resume passing.
+**Ongoing health checks:** the scheduler pings each active bot's `/health`
+endpoint every 15 minutes. Three consecutive failures → marked `INACTIVE`
+in the bot's JSON file. Bots automatically return to `ACTIVE` when health
+checks resume passing.

 ### 8.3 Leaderboard

- Default sort: Glicko-2 display rating (mu - 2*phi) descending
- Columns: rank, bot name, owner, rating, games played, win rate, last active
- Filterable by: player count tier (2p, 3p, 4p, 6p), time range
- Updates in near-real-time (WebSocket push or 30-second polling)
- Public — no login required to view
+The leaderboard is a **JSON file** (`/data/leaderboard.json`) rebuilt by the
+scheduler every few minutes after new match results arrive.
+
+```json
+{
+  "updated_at": "2026-03-23T14:35:00Z",
+  "entries": [
+    {
+      "rank": 1,
+      "bot_id": "b_4e8c1d2f",
+      "name": "SwarmBot",
+      "owner": "alice",
+      "rating": 1820,
+      "games": 142,
+      "wins": 98,
+      "losses": 40,
+      "draws": 4,
+      "evolved": false,
+      "last_match": "2026-03-23T14:30:00Z"
+    }
+  ]
+}
+```
+
+- Client-side sorting and filtering (by player count tier, time range,
+  human-only vs all)
+- No real-time updates — page refresh or auto-refresh every 60 seconds
+- Public — no login required

 ### 8.4 Match History & Profiles

-**Bot profile page** (`/bot/{bot_name}`):
- Current rating + rating history chart
+**Bot profile** (`/bot/{bot_id}`) — fetches `/data/bots/{bot_id}.json`:
+- Current rating + rating history (array of `[timestamp, rating]` pairs
+  rendered as a chart client-side)
 - Recent matches (last 50) with links to replay viewer
 - Win/loss/draw breakdown
- Performance vs. each opponent
 - Bot description, owner, registration date
+- If evolved: lineage, generation, island

-**User profile page** (`/user/{username}`):
- List of owned bots
- Aggregate statistics across all bots
+**Match list** (`/matches`) — fetches `/data/matches/index.json`:
+- Paginated list of recent matches
+- Each entry: match_id, participants, scores, date, link to replay

-**Match page** (`/match/{match_id}`):
- Participants, map, final scores
+**Match detail** (`/replay/{match_id}`):
+- Fetches `/data/matches/{match_id}.json` for metadata
+- Fetches `/replays/{match_id}.json.gz` for the replay
 - Embedded replay viewer (auto-plays)
- Turn-by-turn event log (collapsible)
+- Score breakdown, participants, match duration

 ---

 ## 9. Deployment & Infrastructure

-### 9.1 Container Architecture
+### 9.1 Design Principles

-| Image | Base | Purpose | Replicas |
-|-------|------|---------|----------|
-| `acb-api` | Go binary on Alpine | REST API server | 2 (always-on) |
-| `acb-worker` | Go binary on Alpine | Match execution worker | 3–10 (spot) |
-| `acb-scheduler` | Go binary on Alpine | Tournament matchmaking | 1 (always-on) |
-| `acb-web` | Nginx + static files | Frontend SPA | 1 (or CDN) |
-| `acb-strategy-random` | Python 3.13 slim | RandomBot | 1 |
-| `acb-strategy-gatherer` | Go on Alpine | GathererBot | 1 |
-| `acb-strategy-rusher` | Rust on Alpine | RusherBot | 1 |
-| `acb-strategy-guardian` | PHP 8.4 CLI Alpine | GuardianBot | 1 |
-| `acb-strategy-swarm` | Node 22 Alpine | SwarmBot (TypeScript) | 1 |
-| `acb-strategy-hunter` | Temurin 21 JRE Alpine | HunterBot (Java) | 1 |
-| `acb-evolver` | Go binary on Alpine | Evolution pipeline orchestrator | 1 (always-on) |
-| `acb-evolved-*` | Varies by language | LLM-generated evolved bots | 0–50 (dynamic) |
+The deployment is designed for **Rackspace Spot** instances, which means:

-### 9.2 Rackspace Spot Deployment
+- **Instances can be reclaimed at any time** with a short SIGTERM warning
+- **Pricing is significantly cheaper** than on-demand but availability fluctuates
+- **No persistent local storage** — anything on the instance disappears on reclaim
+- **No guaranteed uptime** — the platform must tolerate all workers disappearing

-Match workers are the primary consumers of compute and are **perfectly suited
-for spot instances**:
+These constraints shape every decision below. The answer is: keep it simple,
+keep it stateless, and put all durable state in object storage.

- **Stateless**: workers pull jobs from a queue, execute, and push results.
-  No persistent local state.
- **Interruptible**: if a spot instance is reclaimed mid-match, the match job
-  is re-queued after a staleness timeout (10 minutes with no progress update).
-  The match is replayed from scratch on another worker.
- **Bursty**: match throughput can flex with spot availability. More instances
-  = faster ladder convergence, but no hard deadline.
+### 9.2 Container Architecture

-**Instance sizing:**
- Match workers: 2 vCPU, 4 GB RAM per instance (each runs one match at a time)
- Strategy bots: can share a single small instance (all 6 use <1GB total;
-  Java's JVM is the biggest consumer at ~256MB)
- API server + scheduler: 2 vCPU, 4 GB RAM, always-on (not spot)
+| Image | Base | Purpose | Where It Runs |
+|-------|------|---------|---------------|
+| `acb-web` | Nginx + static files | Static site (HTML/JS/CSS) | CDN or single stable instance |
+| `acb-register` | Go binary on Alpine | Bot registration API (3 endpoints) | Single stable instance |
+| `acb-scheduler` | Go binary on Alpine | Matchmaking + JSON index rebuilds | Single stable instance |
+| `acb-worker` | Go binary on Alpine | Match execution | Rackspace Spot (1–10) |
+| `acb-evolver` | Go binary on Alpine | Evolution pipeline orchestrator | Rackspace Spot (1) |
+| `acb-strategy-random` | Python 3.13 slim | RandomBot | Stable instance (shared) |
+| `acb-strategy-gatherer` | Go on Alpine | GathererBot | Stable instance (shared) |
+| `acb-strategy-rusher` | Rust on Alpine | RusherBot | Stable instance (shared) |
+| `acb-strategy-guardian` | PHP 8.4 CLI Alpine | GuardianBot | Stable instance (shared) |
+| `acb-strategy-swarm` | Node 22 Alpine | SwarmBot (TypeScript) | Stable instance (shared) |
+| `acb-strategy-hunter` | Temurin 21 JRE Alpine | HunterBot (Java) | Stable instance (shared) |
+| `acb-evolved-*` | Varies by language | LLM-generated evolved bots | Stable instance (shared) |

-**Deployment layout:**
+### 9.3 Rackspace Spot: What Runs Where
+
+**Stable instance (1× small, always-on):**
+
+This is the only always-on server. It runs everything that must survive
+spot reclamation:

 ```
-Always-on tier (standard instances):
-├── acb-api (×2, behind load balancer)
-├── acb-scheduler (×1)
-├── acb-evolver (×1, evolution pipeline orchestrator)
-├── acb-web (×1 or CDN)
-├── acb-strategy-* (×1 each, shared instance)
-├── acb-evolved-* (×0–50, dynamic, promoted bots)
-├── PostgreSQL (managed or self-hosted)
-├── Redis (managed or self-hosted)
-└── Minio / S3-compatible store
-
-Spot tier (preemptible instances):
-├── acb-worker (×3 minimum, scale up as available)
-└── (each worker is a standalone container, no coordination needed)
-
-Evolution tier (can share spot or always-on):
-├── nsjail sandbox runners (for validation + evaluation)
-└── (reuses acb-worker for evaluation matches)
+Single stable instance (2 vCPU, 4 GB RAM):
+├── acb-web (Nginx, serves static site)
+├── acb-register (registration API, lightweight)
+├── acb-scheduler (matchmaking + index rebuilds)
+├── acb-strategy-* (all 6 built-in bots, ~1 GB total)
+├── acb-evolved-* (0–50 evolved bots, dynamic)
+└── Minio (object store, data volume mounted)
 ```

-**Spot reclaim handling:**
-1. Worker registers a shutdown hook that catches SIGTERM
-2. On SIGTERM, worker sets the current match status to `interrupted` in Redis
-3. Worker exits gracefully (within the 30-second SIGTERM grace period)
-4. Scheduler's stale-match reaper detects `interrupted` or stale `in_progress`
-   matches and re-queues them
-5. Another worker picks up the job
+The static site, registration API, scheduler, and all bot HTTP servers share
+one machine. This is feasible because:
+- The static site is Nginx serving files (negligible CPU)
+- The registration API handles ~10 requests/day (negligible)
+- The scheduler runs once every 10 seconds (negligible)
+- Strategy bots are idle between matches (only active during their turns)
+- Minio serves static files (mostly read, low CPU)

-### 9.3 Data Stores
+**Spot instances (1–10×, preemptible):**

-**PostgreSQL:**
- Tables: `users`, `bots`, `matches`, `match_participants`, `maps`, `ratings`
- Single primary instance; read replica for leaderboard queries
- Connection pooling via PgBouncer
- Backup: daily automated dumps to object storage
+Match workers and the evolution pipeline run on spot instances. These are
+the only compute-intensive workloads.

-**Redis:**
- Match job queue (Redis Streams or List-based queue)
- Rate limiting (per-bot, per-endpoint)
- Session cache
- Leaderboard cache (sorted sets)
- No persistence required — queue jobs are recoverable from PostgreSQL match
-  records with `queued` status
+```
+Spot instance A (2 vCPU, 4 GB RAM):
+└── acb-worker (runs 1 match at a time)

-**Object Storage (S3-compatible):**
- Replay files (gzipped JSON)
- Map definition files
- Bot submission metadata / logs
- Signed URL generation for replay access (1-hour expiry)
+Spot instance B (2 vCPU, 4 GB RAM):
+└── acb-worker (runs 1 match at a time)

-### 9.4 Networking & Security
+Spot instance C (4 vCPU, 8 GB RAM):
+└── acb-evolver (evolution pipeline, needs more RAM for LLM context)
+```
+
+**If all spot instances are reclaimed:**
+- The website continues to work (static site on stable instance)
+- Leaderboard and replays remain visible (JSON files in Minio)
+- Bot registration still works (registration API on stable instance)
+- Built-in bots remain reachable (on stable instance)
+- **Only match execution pauses** — the queue accumulates jobs
+- When spot instances return, workers drain the queue and catch up
+- The platform gracefully degrades: visitors see stale-but-valid data
+
+This is the key benefit of the static-site architecture — the entire
+user-facing experience survives spot reclamation.
+
+### 9.4 Data Layer: Object Store + Flat Files
+
+There is **no database**. All platform state is stored as JSON files in
+Minio (S3-compatible object storage) on the stable instance.
+
+**Why no PostgreSQL/Redis:**
+- The platform has low write volume (~60 matches/hour, ~10 registrations/day)
+- All "queries" are pre-computed: the scheduler builds the leaderboard JSON,
+  match index JSON, and bot profile JSONs on a schedule
+- The static site just fetches these files — no query engine needed
+- Minio on a single machine with a mounted volume is simpler to operate,
+  back up, and recover than a PostgreSQL instance
+- Eliminates connection pooling, migrations, schema management, and ORM
+
+**Internal state (scheduler's working data):**
+
+The scheduler maintains a small SQLite database locally for its own bookkeeping:
+- Bot registry (which bots are active, their endpoints, ratings)
+- Match queue (pending, in-progress, completed)
+- Rating history
+
+This SQLite file lives on the stable instance's persistent volume. It is the
+**source of truth** for platform state. The JSON files in Minio are
+**materialized views** rebuilt from SQLite.
+
+If the SQLite file is lost, it can be reconstructed from the JSON files in
+Minio (match results contain all the data needed to replay ratings).
+
+**Backup:** daily copy of the SQLite file + full Minio bucket to a second
+storage location (e.g., rsync to another Rackspace volume or offsite).
+
+### 9.5 Match Job Coordination
+
+Without Redis, workers coordinate through the object store and HTTP:
+
+**Job assignment flow:**
+1. Scheduler writes match jobs as JSON files to Minio:
+   `jobs/pending/{job_id}.json`
+2. Worker polls for pending jobs: `GET /jobs/pending/` (list objects)
+3. Worker claims a job by moving it:
+   `jobs/pending/{job_id}.json` → `jobs/running/{job_id}.json`
+   (atomic rename in Minio; if two workers race, one gets a 404 — retry)
+4. Worker executes the match
+5. Worker writes results:
+   - `replays/{match_id}.json.gz` (replay data)
+   - `jobs/done/{job_id}.json` (result + scores)
+   - Deletes `jobs/running/{job_id}.json`
+6. Scheduler detects completed jobs, updates SQLite, rebuilds JSON indexes
+
+**Stale job recovery:**
+- Scheduler scans `jobs/running/` every 5 minutes
+- Any job that has been running for >15 minutes is assumed abandoned
+  (worker was reclaimed)
+- Scheduler moves it back to `jobs/pending/` for re-execution
+
+This is crude but sufficient for the throughput level (~60 matches/hour).
+No message broker, no pub/sub, no distributed coordination. Just files.
+
+### 9.6 Networking & Security

 **External traffic:**
- Web platform: HTTPS only, behind reverse proxy (Caddy or nginx)
- Bot endpoints: engine connects outbound to registered URLs
+- `aicodebattle.com` → Nginx on stable instance (static site)
+- `data.aicodebattle.com` → Minio on stable instance (JSON data, public-read)
+- `api.aicodebattle.com` → Registration API on stable instance (3 endpoints)
+- All behind Caddy for automatic TLS termination

-**Internal traffic:**
- API ↔ PostgreSQL: private network
- API ↔ Redis: private network
- Workers ↔ Redis: private network (workers may be in different regions — use
-  Redis over TLS if cross-region)
- Workers → bot endpoints: public internet (HTTPS required for competitive bots)
- Workers → strategy bots: private network (same infrastructure)
+**Internal traffic (stable instance → bots):**
+- Strategy bots listen on localhost ports (no external exposure)
+- Workers on spot instances reach strategy bots via Tailscale or public IP
+
+**Workers → external participant bots:**
+- Outbound HTTPS to registered bot URLs
+- No inbound ports on workers

 **Security boundaries:**
 - The game engine (workers) never executes bot code — HTTP only
 - All bot responses are schema-validated before processing
 - HMAC authentication prevents request/response forgery
- Rate limiting on API endpoints (registration, health checks)
- Bot endpoint URLs validated at registration (no internal IPs, no localhost)
- Workers run with no inbound ports — they only make outbound HTTP calls
+- Registration API validates bot endpoint URLs (no internal IPs, no localhost,
+  no private ranges)
+- Minio data bucket is public-read, no-write from outside
+- Registration API is rate-limited (10 registrations/hour max)

-### 9.5 Monitoring
+### 9.7 Cost Model (Rackspace Spot)

-| Signal | Tool | Alert Threshold |
-|--------|------|-----------------|
-| Match throughput | Prometheus counter | <10 matches/hour for >30 minutes |
-| Worker count | Prometheus gauge | <2 live workers for >15 minutes |
-| Bot health check failures | Prometheus counter | >50% of active bots failing |
-| API latency (p99) | Prometheus histogram | >500ms |
-| Match queue depth | Redis metric | >100 pending matches |
-| Replay storage usage | S3 metric | >80% of quota |
-| Error rate (5xx) | Access logs | >1% of requests |
+| Component | Instance Type | Spot? | Est. Monthly |
+|-----------|--------------|-------|-------------|
+| Stable instance | 2 vCPU / 4 GB | No (on-demand) | ~$30–50 |
+| Match workers (×3 avg) | 2 vCPU / 4 GB each | Yes | ~$15–30 |
+| Evolver (×1) | 4 vCPU / 8 GB | Yes | ~$10–20 |
+| Storage (Minio volume) | 100 GB block storage | No | ~$10 |
+| **Total** | | | **~$65–110/mo** |
+
+LLM API costs for the evolution pipeline are separate and depend on model
+choice and generation volume. At ~96 candidates/day with a mix of fast/strong
+models, estimate ~$5–20/day ($150–600/mo).
+
+### 9.8 Monitoring
+
+Monitoring is lightweight, matching the simple architecture:
+
+| Signal | Method | Alert |
+|--------|--------|-------|
+| Stable instance up | External ping (UptimeRobot or similar) | Down >2 minutes |
+| Active spot workers | Scheduler counts `jobs/running/` age | 0 workers for >30 minutes |
+| Match throughput | Scheduler counts `jobs/done/` per hour | <10 matches/hour for >1 hour |
+| Minio disk usage | `df` on volume | >80% |
+| Bot health failures | Scheduler's health check log | >50% of bots failing |
+| Stale jobs | Scheduler's reaper count | >10 stale jobs in a cycle |
+
+Alerts via webhook to a notification channel (Slack, Discord, or email).
+No Prometheus/Grafana stack needed at this scale.

 ---

@ -1558,43 +1748,50 @@ match with all visual elements rendering correctly.
 ### Phase 4: Match Orchestration

 **Deliverables:**
- Match worker service (`acb-worker`): pulls from Redis queue, runs matches,
-  uploads replays, records results
- Tournament scheduler (`acb-scheduler`): matchmaking algorithm, creates jobs
- PostgreSQL schema and migrations
- Stale match reaper (handles interrupted spot instances)
+- Match worker service (`acb-worker`): polls Minio for pending jobs, runs
+  matches, writes replay JSON + result JSON back to Minio
+- Scheduler (`acb-scheduler`): matchmaking algorithm, writes job files to
+  Minio, detects completed jobs, updates SQLite, rebuilds leaderboard/index
+  JSON files
+- Scheduler's SQLite schema for bot registry, match queue, and ratings
+- Stale job reaper (recovers abandoned jobs from reclaimed spot instances)
 - Match result → Glicko-2 rating update pipeline
+- JSON index rebuilder: leaderboard.json, matches/index.json, bots/*.json

-**Exit criteria:** scheduler creates matches, workers execute them
-autonomously, ratings update, replays are stored. System recovers from
-worker interruption.
+**Exit criteria:** scheduler creates match jobs as files, workers pick them
+up and execute autonomously, results flow back as JSON, ratings update, and
+all index files rebuild correctly. System recovers from worker disappearance.

 ### Phase 5: Web Platform

 **Deliverables:**
- API server (`acb-api`): user registration, bot registration, leaderboard,
-  match history, replay URLs
- Web frontend (`acb-web`): registration, bot management dashboard,
-  leaderboard, match history, embedded replay viewer
- Bot health check system (periodic + on-registration)
- Shared secret generation, display, rotation
+- Static site (`acb-web`): leaderboard, match history, bot profiles, replay
+  viewer, registration form, docs/getting-started page
+- Registration API (`acb-register`): bot signup, health check, key rotation
+  (3 endpoints, single Go binary)
+- Bot health check loop in the scheduler (periodic pings)
+- All pages load data by fetching JSON from the object store — no backend
+  rendering

-**Exit criteria:** a user can register, add a bot, see it appear on the
-leaderboard after matches are played, and watch replays of its games.
+**Exit criteria:** a participant can register a bot via the web form, the
+bot appears on the leaderboard after matches complete, and anyone can browse
+matches and watch replays — all from a static site with no application server.

 ### Phase 6: Deployment & Production

 **Deliverables:**
 - Container images pushed to registry
- Rackspace Spot deployment for workers
- Always-on deployment for API, scheduler, strategy bots, datastores
- TLS termination, DNS, CDN for static assets
- Monitoring dashboards and alerts
- Backup automation for PostgreSQL and replay storage
+- Stable instance: Nginx, Registration API, Scheduler, Minio, all strategy
+  bots — single machine with persistent volume
+- Spot instances: match workers configured to poll Minio for jobs
+- Caddy for TLS termination on the stable instance
+- DNS setup (aicodebattle.com, data.aicodebattle.com, api.aicodebattle.com)
+- Monitoring webhooks (uptime ping, worker count, match throughput)
+- Daily SQLite + Minio backup to offsite storage

-**Exit criteria:** platform is publicly accessible, matches run continuously,
-strategy bots compete on the ladder, external participants can register and
-play.
+**Exit criteria:** platform is publicly accessible, matches run on spot
+instances, the site remains fully functional when all spot instances are
+reclaimed, and external participants can register and play.

 ### Phase 7: LLM-Driven Evolution