Add comprehensive chaos testing infrastructure for Miroir failure scenarios: - **TestCluster** harness with chaos helpers: - `kill_meili()` / `restart_meili()` for node failure simulation - `apply_netem()` / `remove_netem()` for network delay injection - `kill_miroir()` / `restart_miroir()` for orchestrator failure - Docker-compose stack lifecycle management - **6 chaos test scenarios** (all marked `#[ignore]`): 1. Kill 1 of 3 nodes (RF=2) → continuous search, no degraded header 2. Kill 2 of 3 nodes (RF=2) → 503 or partial results with degraded header 3. Kill 1 of 2 Miroir replicas → zero client-visible downtime 4. tc netem 500ms delay → searches slow but succeed, no errors 5. Restart killed node → Miroir detects recovery within health check interval 6. Kill node mid-rebalance → rebalancer pauses, resumes on recovery - **Runbooks** in `tests/chaos/runbooks/scenario*.md`: - Manual reproduction steps - Expected observables (metrics, headers, errors) - Recovery procedures - HA vs single-instance differences - Operator notes and common causes - **Updated docker-compose files**: - Added `CAP_NET_ADMIN` to all Meilisearch containers for tc netem support Tests are slow (30+ seconds each) and require docker-compose. Run with: cargo test --test chaos -- --ignored --test-threads=1 Closes: miroir-89x.4 |
||
|---|---|---|
| .. | ||
| runbooks | ||
| 01_kill_one_node_rf2.md | ||
| 02_kill_two_nodes_rf2.md | ||
| 03_kill_miroir_replica.md | ||
| 04_network_delay.md | ||
| 05_restart_node.md | ||
| 06_kill_during_rebalance.md | ||
| README.md | ||
| run_all_chaos_tests.sh | ||
Miroir Chaos Tests
This directory contains chaos engineering tests for Miroir. These tests verify system behavior under failure conditions and ensure the system meets its availability and consistency guarantees.
Prerequisites
Start the test environment:
cd /home/coding/miroir/examples
docker-compose -f docker-compose-dev.yml up -d
Wait for all services to be healthy:
docker-compose ps
Scenarios
Each scenario has its own runbook with detailed steps:
- Kill 1 of 3 nodes (RF=2) — Continuous search; degraded writes warn via header
- Kill 2 of 3 nodes (RF=2) — Shard loss; 503 or partial per policy
- Kill 1 of 2 Miroir replicas — Zero client-visible downtime
- [Network delay 500ms]((./04_network_delay.md) — Search slows, no errors
- Restart killed node — Miroir detects within health interval
- Kill node mid-rebalance — Pause + resume; no data loss
Running Tests
Automated
cd /home/coding/miroir/tests/chaos
./run_all_chaos_tests.sh
Manual
Follow the steps in each scenario's runbook.
Cleanup
Stop the test environment:
cd /home/coding/miroir/examples
docker-compose -f docker-compose-dev.yml down -v
Expected Behaviors
RF=2 Configuration
- 1 node down: Continued reads, writes degrade with warning header
- 2 nodes down: Shard unavailability, 503 errors or partial results
Miroir Replica Resilience
- 1 replica down: Zero client-visible downtime (load balancer fails over)
Rebalance Safety
- Node killed during rebalance: Pauses, resumes on restart, no data loss
Monitoring
During chaos tests, monitor:
- Miroir logs:
docker logs miroir-orchestrator -f - Meilisearch logs:
docker logs miroir-meili-0 -f - Health status:
curl http://localhost:7700/health