miroir/tests/chaos
jedarden 304879d32a feat(tests): add chaos test scenarios and runbooks (plan §8, P9.4)
Add comprehensive chaos testing infrastructure for Miroir failure scenarios:

- **TestCluster** harness with chaos helpers:
  - `kill_meili()` / `restart_meili()` for node failure simulation
  - `apply_netem()` / `remove_netem()` for network delay injection
  - `kill_miroir()` / `restart_miroir()` for orchestrator failure
  - Docker-compose stack lifecycle management

- **6 chaos test scenarios** (all marked `#[ignore]`):
  1. Kill 1 of 3 nodes (RF=2) → continuous search, no degraded header
  2. Kill 2 of 3 nodes (RF=2) → 503 or partial results with degraded header
  3. Kill 1 of 2 Miroir replicas → zero client-visible downtime
  4. tc netem 500ms delay → searches slow but succeed, no errors
  5. Restart killed node → Miroir detects recovery within health check interval
  6. Kill node mid-rebalance → rebalancer pauses, resumes on recovery

- **Runbooks** in `tests/chaos/runbooks/scenario*.md`:
  - Manual reproduction steps
  - Expected observables (metrics, headers, errors)
  - Recovery procedures
  - HA vs single-instance differences
  - Operator notes and common causes

- **Updated docker-compose files**:
  - Added `CAP_NET_ADMIN` to all Meilisearch containers for tc netem support

Tests are slow (30+ seconds each) and require docker-compose. Run with:
  cargo test --test chaos -- --ignored --test-threads=1

Closes: miroir-89x.4
2026-05-24 10:23:24 -04:00
..
runbooks feat(tests): add chaos test scenarios and runbooks (plan §8, P9.4) 2026-05-24 10:23:24 -04:00
01_kill_one_node_rf2.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
02_kill_two_nodes_rf2.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
03_kill_miroir_replica.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
04_network_delay.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
05_restart_node.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
06_kill_during_rebalance.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
README.md feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00
run_all_chaos_tests.sh feat(multi-search): implement timeout enforcement and acceptance tests (§13.11) 2026-05-24 01:54:20 -04:00

Miroir Chaos Tests

This directory contains chaos engineering tests for Miroir. These tests verify system behavior under failure conditions and ensure the system meets its availability and consistency guarantees.

Prerequisites

Start the test environment:

cd /home/coding/miroir/examples
docker-compose -f docker-compose-dev.yml up -d

Wait for all services to be healthy:

docker-compose ps

Scenarios

Each scenario has its own runbook with detailed steps:

  1. Kill 1 of 3 nodes (RF=2) — Continuous search; degraded writes warn via header
  2. Kill 2 of 3 nodes (RF=2) — Shard loss; 503 or partial per policy
  3. Kill 1 of 2 Miroir replicas — Zero client-visible downtime
  4. [Network delay 500ms]((./04_network_delay.md) — Search slows, no errors
  5. Restart killed node — Miroir detects within health interval
  6. Kill node mid-rebalance — Pause + resume; no data loss

Running Tests

Automated

cd /home/coding/miroir/tests/chaos
./run_all_chaos_tests.sh

Manual

Follow the steps in each scenario's runbook.

Cleanup

Stop the test environment:

cd /home/coding/miroir/examples
docker-compose -f docker-compose-dev.yml down -v

Expected Behaviors

RF=2 Configuration

  • 1 node down: Continued reads, writes degrade with warning header
  • 2 nodes down: Shard unavailability, 503 errors or partial results

Miroir Replica Resilience

  • 1 replica down: Zero client-visible downtime (load balancer fails over)

Rebalance Safety

  • Node killed during rebalance: Pauses, resumes on restart, no data loss

Monitoring

During chaos tests, monitor:

  • Miroir logs: docker logs miroir-orchestrator -f
  • Meilisearch logs: docker logs miroir-meili-0 -f
  • Health status: curl http://localhost:7700/health