Implements plan §2 topology changes and §4 rebalancer with full elastic
cluster operations: node addition/removal, replica group management, and
unplanned failure handling.
Core changes:
- topology.rs: Add GroupState::Draining for group removal flow
- router.rs: query_group_active() excludes draining groups via is_routing()
- scatter.rs: Health filtering with cross-group fallback for failed nodes
- rebalancer.rs: Add handle_node_recovery() for RF restore after recovery
- main.rs: Unplanned node failure detection with consecutive failure/success
tracking, automatic Degraded/Failed transitions, and recovery event triggers
Admin API:
- POST /_miroir/nodes/{id}/recover - Mark failed node as recovered
- DELETE /_miroir/nodes/{id} - Remove node (after drain)
- POST /_miroir/nodes/{id}/drain - Start node drain for removal
- POST /_miroir/nodes/{id}/fail - Mark node as failed
- POST /_miroir/replica_groups - Add replica group
- GET /_miroir/replica_groups/{id}/status - Group sync progress
- POST /_miroir/replica_groups/{id}/activate - Mark group active
- DELETE /_miroir/replica_groups/{id} - Remove replica group
Tests:
- p4_topology_chaos.rs: All 5 chaos tests pass
* Add node mid-indexing: docs readable, no duplicates
* Drain node while querying: zero client-visible failures
* Add replica group while querying: existing groups unaffected
* Rebalance moves ≤ 2×(1/4) of docs (optimal)
* Restart node mid-rebalance: pauses + resumes, no data loss
- p25_task_reconciliation.rs: Task ID reconciliation acceptance tests
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>