jedarden 232092ffbb P0.5: Implement Config struct mirroring plan §4/§13 YAML schema

Full serde-derived struct tree covering every block in plan §4 (MiroirConfig,
NodeConfig, TaskStoreConfig, AdminConfig, HealthConfig, ScatterConfig,
RebalancerConfig, ServerConfig, ConnectionPoolConfig, TaskRegistryConfig) and
all 21 §13 advanced-capability sub-structs (ReshardingConfig through
SearchUiConfig with nested auth/rate-limit/CSP/analytics structs), plus §14
horizontal-scaling structs (PeerDiscoveryConfig, LeaderElectionConfig, HpaConfig).

Includes:
- Layered loading via config crate: built-in defaults → file → env overrides
- Config::validate() with 14 cross-field rules (HA requires redis, scoped_key
  timing inversion, node group bounds, tenant affinity range checks, etc.)
- 10 unit tests: round-trip YAML, full plan example, minimal YAML defaults,
  and validation rejection cases

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-18 21:46:12 -04:00

24 KiB

Raw Blame History

P12.OP2: Lightweight Raft vs. Redis for Task State HA

Date: 2026-04-18 Status: Decision recorded — revisit before v2.0, do not ship in v0.x or v1.0 Bead: miroir-zc2.2 Plan ref: §15 Open Problem #2, §4 Task store schema, §14.2 Per-pod memory budget Prototype: crates/miroir-core/src/raft_proto/ (feature-gated behind raft-proto)

Executive Summary

Replacing Redis with an embedded Raft consensus module is feasible but not justified for v1.x. The operational benefit (removing an external dependency) is real, but the cost is high: significant implementation complexity, a new correctness surface (Raft consensus bugs can silently lose data), higher per-pod memory and CPU overhead, and no latency advantage over Redis for Miroir's workload profile.

Decision: Revisit before v2.0. Ship Redis backend in v1.0 as planned. Re-evaluate Raft when the task store is production-stabilized and the operational burden of managing Redis is empirically measured.

1. Problem Statement

Miroir's task store (14 tables, plan §4) uses SQLite for single-replica deployments and Redis for HA (2+ replicas). Redis is required because SQLite is single-writer — two pods cannot write to the same .db file.

Open Problem #2 asks: can we embed a Raft consensus module so that N Miroir pods replicate task state among themselves, eliminating the Redis dependency?

Decision gate (from plan): the Raft path must be measurably better than Redis on at least one metric (ops simplicity, latency, or memory) without being worse on any of the others.

2. Crate Survey

2.1 Candidates Evaluated

Crate	Version	Stars	Last Activity	Status
openraft	0.9.22 (stable), 0.10.0-alpha.17	~1,890	2026-04-18 (today)	Actively maintained
raft-rs (tikv)	0.7.0	~3,324	2026-04-14	Actively maintained, mature
async-raft	0.6.1	~1,091	2023-02-12	Abandoned — do not use

2.2 Elimination

async-raft is eliminated immediately. It has been abandoned since February 2023 with known correctness bugs in membership changes and snapshot replication. openraft was created specifically as a bug-fixed fork of async-raft. No new project should use async-raft.

2.3 Detailed Comparison: openraft vs. raft-rs

API Design

Aspect	openraft	raft-rs
Async	Native async (tokio, configurable runtime)	Synchronous only
Traits	Split: `RaftLogStorage` + `RaftStateMachine` (separate concerns)	Single `Storage` trait (monolithic)
Network	`RaftNetworkFactory` → per-node connections	None — you handle all transport
Pattern	Higher-level: propose → apply → done	Low-level: tick → ready → advance loop
Linearizable reads	Built-in `read_index`	Manual implementation required
Snapshots	`RaftSnapshotBuilder` trait, streaming support	`create_snapshot`/`apply_snapshot` on Storage
Membership changes	`ChangeMembership` API, joint consensus	`ConfChangeV2`, joint consensus

Production Users

openraft	raft-rs
Databend (cloud data warehouse — original sponsor)	TiKV / TiDB (distributed DB — original sponsor)
GreptimeDB (time-series DB)	RisingWave (streaming DB)
CnosDB (time-series DB)	HStreamDB (streaming platform)

Both are battle-tested in production databases handling petabytes of data.

SQLite Compatibility

Aspect	openraft	raft-rs
Storage trait	Async — requires `spawn_blocking` for SQLite calls	Synchronous — call SQLite directly
Log/state machine split	Yes — can use different backends	No — single trait combines both
Fit	Good (split traits help), minor async overhead	Best — sync traits are natural for SQLite

Memory Footprint

Aspect	openraft	raft-rs
Dependencies	Moderate (tokio optional, tracing, serde)	Minimal (no async runtime, slog, protobuf)
Runtime	Configurable or none (`single-threaded` feature)	None required
Baseline	~15-25 MB (with tokio)	~5-10 MB (pure sync)

2.4 Recommendation for Miroir

openraft is the better choice if we proceed, for three reasons:

Miroir is async-native (tokio + axum). raft-rs's sync API would require wrapping every storage/network call in spawn_blocking, which is error-prone in an async context and can cause thread-pool starvation under load.
Split traits (RaftLogStorage + RaftStateMachine) map naturally to Miroir's architecture: the Raft log can live in SQLite tables, while the state machine applies entries to the same 14-table schema already designed.
Active development and community. openraft is the most actively maintained Rust Raft crate with a responsive maintainer and multiple production users outside its original sponsor.

3. Prototype Design: Raft-Backed TaskStore

3.1 Architecture

┌─────────────────────────────────────────────────┐
│                   Miroir Pod                     │
│                                                  │
│  ┌──────────┐    ┌─────────────────────────┐     │
│  │ axum HTTP│───▶│     TaskStore trait      │     │
│  │ handler  │    │  (RaftTaskStore impl)    │     │
│  └──────────┘    └─────────┬───────────────┘     │
│                            │                      │
│                  ┌─────────▼─────────┐            │
│                  │   openraft Raft    │            │
│                  │   (in-process)     │            │
│                  └─────┬───────────┬─┘            │
│                        │           │               │
│              ┌─────────▼──┐  ┌─────▼──────────┐   │
│              │ LogStorage  │  │ StateMachine   │   │
│              │ (SQLite)    │  │ (SQLite)       │   │
│              │             │  │                 │   │
│              │ raft_log    │  │ tasks           │   │
│              │ raft_state  │  │ aliases         │   │
│              │ raft_snap   │  │ sessions        │   │
│              │             │  │ jobs            │   │
│              │             │  │ ... (14 tables) │   │
│              └─────────────┘  └────────────────┘   │
│                                                  │
│         Network: gRPC/TCP to peer pods           │
└─────────────────────────────────────────────────┘

3.2 Storage Layout

One SQLite database per pod, three internal namespaces:

-- Raft log (managed by RaftLogStorage impl)
CREATE TABLE raft_log (
    log_id_index  INTEGER PRIMARY KEY,
    log_id_term   INTEGER NOT NULL,
    payload       BLOB NOT NULL    -- serialized TaskStore command
);

CREATE TABLE raft_state (
    key   TEXT PRIMARY KEY,        -- 'hard_state', 'vote', 'snapshot'
    value BLOB NOT NULL
);

-- State machine tables (exact same 14 tables as plan §4)
CREATE TABLE tasks (...);       -- unchanged
CREATE TABLE aliases (...);     -- unchanged
-- ... etc for all 14 tables

All writes go through Raft consensus. Reads are local SQLite reads against the state machine (optionally via read_index for linearizable reads on the leader).

3.3 Command Protocol

Every mutating TaskStore operation is serialized as a Raft log entry:

#[derive(Serialize, Deserialize)]
enum TaskStoreCommand {
    // Table 1: tasks
    InsertTask { miroir_id: String, created_at: i64, status: String, node_tasks: String },
    UpdateTaskStatus { miroir_id: String, status: String, error: Option<String> },
    DeleteTask { miroir_id: String },

    // Table 3: aliases
    UpsertAlias { name: String, kind: String, current_uid: Option<String>, ... },
    DeleteAlias { name: String },

    // Table 7: leader_lease
    AcquireLease { scope: String, holder: String, expires_at: i64 },
    ReleaseLease { scope: String },

    // ... one variant per mutating operation across all 14 tables
}

The RaftStateMachine::apply() method deserializes each command and executes the corresponding SQLite write within a transaction. This guarantees that all pods apply commands in the same order.

3.4 Read Path

Read type	Mechanism
Task status poll (hot path)	Local SQLite read — eventual consistency acceptable (status updates are async anyway)
Alias lookup	Local read with short TTL cache — same as Redis approach
Leader lease check	`read_index` on leader for linearizability — or local read if stale reads are tolerable for the 3s renewal window
Admin session verify	Local read — revocation uses Raft to propagate

3.5 Network Transport

Pod-to-pod communication over the headless Service:

struct MiroirNetwork {
    peers: Arc<DashMap<NodeId, Channel>>,
}

impl RaftNetworkFactory for MiroirNetwork {
    // Uses the existing peer discovery mechanism (headless Service DNS)
    // Each pod maintains a TCP connection pool to every other pod
    // Serialization: bincode (fast, compact) or prost (protobuf-compatible)
}

Port: a dedicated Raft port (e.g., 9001) on each pod, separate from the HTTP proxy port.

3.6 Startup and Recovery

Pod starts, discovers peers via headless Service DNS
Opens local SQLite, replays any unapplied log entries
Joins Raft cluster (or initializes if first node)
If lagging, receives a snapshot from the leader
Begins serving requests once caught up

Snapshot interval: every 10,000 log entries or 5 minutes, whichever comes first. Snapshots are written to the raft_snap table and can also be persisted to object storage for disaster recovery.

4. Analytical Benchmark

Since Miroir has no running code yet, these are analytical estimates based on the known performance characteristics of Redis, SQLite, and Raft, calibrated against published benchmarks from Databend (openraft) and TiKV (raft-rs).

4.0 Measured: State Machine Apply Path

The prototype benchmark (raft_proto::benchmark) measures the actual apply-path overhead of the command-based state machine vs. direct HashMap access. Run with:

cargo test -p miroir-core --features raft-proto raft_proto::benchmark -- --nocapture

Results (50,000 ops, 3 nodes per task, stable Rust 1.87):

Operation	State Machine	Direct HashMap	Overhead
Insert	1,860 ns	1,847 ns	1.0x
Read	251 ns	235 ns	1.1x
Update	320 ns	309 ns	1.0x

Serialization	Avg Latency	Size per Command
JSON	1,474 ns	73 bytes
Bincode	428 ns	26 bytes

Throughput (single-threaded, local apply only): ~538K ops/sec

Key finding: The state machine apply path adds negligible overhead (~1.0x) vs. direct HashMap access. Both are sub-microsecond. The real cost of Raft consensus is network round-trips + fsync, not the apply logic.

4.1 Latency: Write Path

A write to the task store goes through: client → Miroir handler → task store backend → response.

Operation	Redis (est.)	Raft 3-node (est.)	Raft 5-node (est.)
Insert task	0.3–0.8 ms (HSET + SADD pipeline)	2–5 ms (propose → majority ack → apply)	3–7 ms
Update task status	0.3–0.8 ms	2–5 ms	3–7 ms
Acquire leader lease	0.5–1.0 ms (SET NX EX)	2–5 ms	3–7 ms
Alias flip (write)	0.5–1.0 ms (MULTI/EXEC)	2–5 ms	3–7 ms

Raft is 3–8x slower than Redis on writes because every write must be replicated to a majority of pods (network round-trips) before it's committed. Redis writes are local to the Redis process (single-node latency) — the replication happens at the Redis/Sentinel layer, not in the client path.

4.2 Latency: Read Path

Operation	Redis (est.)	Raft (local read)
Get task by ID	0.2–0.5 ms	0.05–0.2 ms (local SQLite)
List all aliases	0.3–0.8 ms (SMEMBERS + HMGET pipeline)	0.1–0.3 ms (local SQLite)
Check session validity	0.2–0.5 ms	0.05–0.2 ms

Raft is faster on reads because reads hit the local SQLite state machine — no network hop. Redis reads always require a network round-trip to the Redis server.

However, the read advantage is marginal in absolute terms (sub-millisecond for both) and Miroir's hot-path reads (task status polling) are not latency-sensitive — the plan already accepts async polling with eventual consistency.

4.3 Throughput

Metric	Redis	Raft (3-node)
Writes/sec (single key)	~100K	~5K–15K
Writes/sec (batched, 100 keys)	~500K	~20K–50K
Reads/sec	~100K	~500K+ (local SQLite)

Redis's throughput advantage on writes comes from being a single-process in-memory store with no consensus overhead. Raft's write throughput is bounded by the consensus round-trip time and log persistence (fsync).

Miroir's write volume is low. Task store writes are proportional to document mutations (not searches). At 1 kQPS write volume with ~10 task store mutations per write, that's 10K writes/sec — within Raft's capability but with less headroom than Redis.

4.4 Memory Footprint (per pod)

Component	Redis Backend	Raft Backend
Task store data (in Miroir pod)	0 (lives in Redis process)	50–100 MB (SQLite + in-memory cache)
Raft log cache	—	20–50 MB
Raft runtime overhead	—	15–25 MB
Network buffers (peer connections)	—	5–10 MB
Total additional per pod	0	90–185 MB

Redis moves the memory cost to the Redis process (shared across pods). Raft replicates the cost to every pod. For the 3.75 GB envelope (plan §14.2), Raft consumes an additional 90–185 MB per pod — a 5–10% reduction in available burst headroom.

For the Redis process itself, the memory cost is roughly:

Task data: ~50 MB for 100K tasks
Session + idempotency: ~150 MB
Rate limit buckets: ~20 MB
Redis overhead: ~30 MB
Total: ~250 MB (shared across all Miroir pods)

4.5 Operational Complexity

Dimension	Redis	Raft
External dependency	Redis server + Sentinel or cluster	None
Backup	`redis-cli --rdb` or `SAVE`	SQLite file copy (per pod) + consensus guarantees
Monitoring	Redis metrics (latency, memory, connected clients)	Raft-specific metrics (leader status, log lag, commit index)
Failure mode	Redis down → all pods lose shared state	Pod down → Raft continues; majority lost → cluster stalls
Recovery	Redis restart → RDB/AOF replay	Pod restart → replay log from SQLite; cluster restart → quorum recovery
Secret rotation	Redis password (if used)	No secrets, but must manage Raft membership
Operator familiarity	High — Redis is widely known	Low — embedded Raft is niche
Helm chart complexity	Redis as a dependency (subchart or external)	No external deps, but membership bootstrap logic

4.6 Correctness Risk

This is the most important dimension.

Risk	Redis	Raft
Data loss	Redis AOF/RDB persistence can lose last 1s of writes	Raft guarantees committed entries survive minority failures
Split brain	Redis Sentinel can theoretically split-brain	Raft's term-based voting prevents split-brain by protocol
Implementation bugs	Redis is 15+ years old, battle-tested	openraft is ~4 years old, used in 3–4 production systems
Operational mistakes	Misconfiguring Redis persistence is common	Misconfiguring Raft membership can leave cluster inoperable

Redis is boring and well-understood. Raft is correct in theory but the implementation is newer and less battle-tested at Miroir's scale. A Raft bug in openraft could silently lose or duplicate task state in ways that are extremely difficult to diagnose.

5. Decision Matrix

Applying the plan's decision gate: Raft must be measurably better on at least one metric without being worse on any other.

Metric	Redis	Raft	Verdict
Write latency	0.3–0.8 ms	2–5 ms	Redis wins (3–8x)
Read latency	0.2–0.5 ms	0.05–0.2 ms	Raft wins (2–5x)
Write throughput	~100K ops/s	~5–15K ops/s	Redis wins (7–20x)
Read throughput	~100K ops/s	~500K+ ops/s	Raft wins (5x)
Memory (per pod)	0 additional	+90–185 MB	Redis wins
Memory (total cluster)	~250 MB shared	90–185 MB × N pods	Tie at 2 pods; Redis wins at 3+
Ops simplicity (deps)	Requires Redis	No external dep	Raft wins
Ops simplicity (failure)	Single failure domain (Redis)	Distributed failure (Raft quorum)	Redis wins (simpler mental model)
Correctness maturity	Very high (15+ years)	Moderate (~4 years, 3–4 prod users)	Redis wins
Backup/restore	Standard tooling	Custom (SQLite + Raft recovery)	Redis wins

Score

Raft wins on: ops simplicity (no external dep), read latency, read throughput
Raft loses on: write latency, write throughput, memory per pod, correctness maturity, operational tooling

Raft does not pass the decision gate. It is better on some metrics but worse on others — specifically worse on the metric that matters most for a consensus system: correctness maturity and write latency.

6. Decision

Ship: No.

Do not ship a Raft-backed task store in v0.x or v1.0.

Revisit: Before v2.0.

Re-evaluate when all of the following are true:

Redis backend is production-stabilized (at least 6 months of production traffic with no data-loss incidents)
The operational cost of Redis is empirically measured — how often does Redis cause incidents? How much operator time does it consume? If the answer is "almost never," Raft is unnecessary.
openraft reaches v1.0 stable — the current v0.10 alpha series has frequent breaking changes. Waiting for API stability avoids rewriting the integration.
Miroir has a working backup/restore story for Raft — before shipping, we need a documented procedure for recovering a Raft cluster after losing majority, and a tested snapshot-to-fresh-cluster restore path.

Rationale

Redis works. It's the industry-standard solution for shared state across stateless replicas. The operational burden of running Redis is well-understood and can be delegated to managed services (ElastiCache, Upstash, Redis Cloud) if self-hosting is undesirable.
The write latency penalty is material. Miroir's task store writes happen on the critical path of document mutations. Adding 2–5 ms of consensus latency per write, when Redis adds <1 ms, is a measurable degradation that the decision gate explicitly forbids.
The complexity budget is better spent elsewhere. Miroir's v1.0 has 21 advanced capabilities to ship (§13.1–§13.21), each with its own correctness surface. Adding a Raft implementation to the v1.0 scope would be a significant distraction with high downside risk.
Raft's advantage (no external dependency) is modest for K8s deployments. In Kubernetes, Redis is a standard add-on (Helm subchart, Bitnami chart, or managed service). It is not a novel operational burden. The real benefit of eliminating Redis would be for single-node deployments — but those already use SQLite.
The read advantage is irrelevant for Miroir. Sub-millisecond reads from Redis vs. sub-millisecond reads from local SQLite — the difference is invisible to clients and to the proxy's p99 latency budget.

Possible Future: Hybrid Approach

If we revisit and decide to ship Raft, the cleanest path is:

Implement TaskStore trait as planned (SQLite backend first, Redis backend second)
Add a third RaftTaskStore that composes SqliteTaskStore as the state machine, wrapped by openraft
All three backends share the same trait — the only difference is config
Migration path: sqlite → redis → raft is a config change, not a code rewrite

This preserves the investment in the SQLite and Redis backends and avoids forcing a binary choice.

Compilation Note

openraft 0.9.22 fails to compile on stable Rust 1.87 because its dependency validit 0.2.5 uses the unstable let_chains feature. The prototype works around this by simulating Raft consensus rather than depending on openraft directly — only bincode is needed for the serialization benchmarks. This compilation failure is itself a data point: a dependency that requires nightly Rust is not suitable for production use in v1.0.

7. Alternative Considered: LiteFS

LiteFS is a FUSE-based SQLite replication tool that transparently replicates SQLite writes to other nodes. It was considered as an alternative to both Redis and Raft.

Eliminated because:

Requires FUSE support in the container (not available in all K8s environments, especially hardened/flatcar nodes)
Single-writer model (one primary, others are read-only replicas) — the primary failover requires an external consul/election mechanism
Adds a FUSE filesystem layer between SQLite and the kernel, introducing latency and debugging complexity
Designed for Fly.io's infrastructure; using it elsewhere is possible but not its primary use case

Not suitable for Miroir's multi-writer K8s deployment model.

8. Appendix: Crate Deep-Dive

openraft v0.9.22 (Stable)

[dependencies]
openraft = { version = "0.9", features = ["serde", "type-alias"] }

Key types:

Raft<C> — the main Raft node; generic over config type C
RaftLogStorage<C> — persistent log storage trait
RaftStateMachine<C> — state machine application trait
RaftNetworkFactory<C> — creates per-peer network connections
Entry<C> — a log entry with payload
Snapshot<C> — a state machine snapshot

Configuration knobs relevant to Miroir:

heartbeat_interval — default 500ms (Miroir: 3s to match current leader lease interval)
election_timeout_min/max — default 150–300ms (Miroir: 3–5s for K8s network)
max_payload_entries — default 300 (batch log appends for throughput)
snapshot_policy — SnapshotPolicy::LogsSinceLast(10000) (snapshot every 10K entries)

raft-rs v0.7.0

Not recommended for Miroir due to sync-only API (see §2.4), but included for completeness:

[dependencies]
raft = "0.7"