miroir/tests/chaos/02_kill_two_nodes_rf2.md
jedarden 158752fe7b feat(multi-search): implement timeout enforcement and acceptance tests (§13.11)
- Add per-query and total timeout enforcement to MultiSearchExecutor
- Improve SearchResult with helper methods (ok, err, timeout, is_success)
- Fix ModeACoordinator feature-gate compilation issues
- Add comprehensive acceptance tests for multi-search:
  - 5-query batch completion
  - Slow query doesn't block fast queries (parallel execution)
  - Partial failure handling
  - Per-query timeout
  - Total timeout
  - 100-query batch completion

Closes: miroir-uhj.11
2026-05-24 01:54:20 -04:00

5.4 KiB

Chaos Test 02: Kill 2 of 3 Nodes (RF=2)

Objective

Verify that with 2 of 3 nodes down (RF=2), the system properly handles shard loss with appropriate error responses (503 or partial results per policy).

Preconditions

  • 3-node Meilisearch cluster running (RF=2)
  • Miroir orchestrator healthy
  • Test index with documents distributed across shards

Test Steps

1. Setup: Create test index with distributed data

# Create index
curl -X POST 'http://localhost:7700/indexes' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  --data-binary '{
    "uid": "chaos_test_02",
    "primaryKey": "id"
  }'

# Add documents (will distribute across shards)
for i in {1..20}; do
  curl -X POST "http://localhost:7700/indexes/chaos_test_02/documents" \
    -H 'Authorization: Bearer dev-key' \
    -H 'Content-Type: application/json' \
    --data-binary "{\"id\": $i, \"title\": \"Document $i\", \"batch\": $((i/10+1))}"
done

# Wait for indexing
sleep 10

2. Verify baseline: All documents searchable

curl -X POST 'http://localhost:7700/indexes/chaos_test_02/search' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  --data-binary '{"q": "", "limit": 100}' \
  | jq '{totalHits: .totalHits, retrieved: .hits | length}'

# Expected: 20 total hits

3. Kill first Meilisearch node

docker stop miroir-meili-0
docker kill miroir-meili-0

echo "Node 0 killed. System should still be operational (RF=2)."
sleep 5

4. Verify: Still operational (1 node down)

curl -X POST 'http://localhost:7700/indexes/chaos_test_02/search' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  --data-binary '{"q": "", "limit": 100}' \
  -w '\nHTTP Status: %{http_code}\n'

# Expected: 200 OK, all data still available (RF=2 means replicas exist)

5. Kill second Meilisearch node (critical failure)

docker stop miroir-meili-1
docker kill miroir-meili-1

echo "CRITICAL: 2 of 3 nodes down. Some shards may be unavailable."
sleep 5

6. Test: Shard loss behavior

# Attempt search - behavior depends on policy
echo "=== Testing search with 2 nodes down ==="
curl -X POST 'http://localhost:7700/indexes/chaos_test_02/search' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  --data-binary '{"q": "", "limit": 100}' \
  -w '\nHTTP Status: %{http_code}\n' \
  -D -

# Expected behaviors (per policy):
# Option A: 503 Service Unavailable with shard loss error
# Option B: Partial results with warning header indicating degraded response

7. Test: Write behavior during shard loss

echo "=== Testing write with 2 nodes down ==="
curl -X POST 'http://localhost:7700/indexes/chaos_test_02/documents' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  -i \
  --data-binary '{"id": 999, "title": "Emergency Document"}'

# Expected: Either 503 (write rejected) or 202 with warning (write queued/failed)

8. Verify error message format

# Capture full error response
RESPONSE=$(curl -X POST 'http://localhost:7700/indexes/chaos_test_02/search' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  --data-binary '{"q": "document"}' \
  -s)

echo "$RESPONSE" | jq '.'

# Expected error structure (if 503):
# {
#   "message": "shard unavailable: insufficient replicas for shard X",
#   "code": "shard_unavailable",
#   "type": "system_error",
#   "link": "https://docs.miroir.dev/errors/shard_unavailable"
# }

9. Recovery: Restart nodes

echo "=== Recovery: Restarting nodes ==="
docker start miroir-meili-0
docker start miroir-meili-1

# Wait for cluster recovery
echo "Waiting for cluster recovery..."
sleep 30

# Verify health
curl -X GET 'http://localhost:7700/health' -H 'Authorization: Bearer dev-key' | jq '.'

10. Verify: Full functionality restored

curl -X POST 'http://localhost:7700/indexes/chaos_test_02/search' \
  -H 'Authorization: Bearer dev-key' \
  -H 'Content-Type: application/json' \
  --data-binary '{"q": "", "limit": 100}' \
  | jq '{totalHits: .totalHits, retrieved: .hits | length}'

# Expected: 20 total hits, full functionality restored

Expected Results

State Search Behavior Write Behavior
1 node down 200 OK, full results 202 with degradation warning
2 nodes down 503 OR partial results 503 OR 202 with error

Success Criteria

  • System operational with 1 node down (RF=2)
  • System properly handles 2 nodes down per policy (503 or partial)
  • Error messages include clear shard availability information
  • All nodes recover gracefully on restart
  • Full functionality restored after recovery

Policy Notes

The expected behavior for 2 nodes down depends on Miroir's configuration policy:

Fail-Closed Policy (Recommended for consistency):

  • Returns 503 Service Unavailable
  • Clear error message indicating shard loss
  • No partial results served

Fail-Open Policy (For availability):

  • Returns partial results from available shards
  • Includes warning header: X-Miroir-Partial-Results: true
  • Response includes availableShards count

Cleanup

# Delete test index
curl -X DELETE 'http://localhost:7700/indexes/chaos_test_02' \
  -H 'Authorization: Bearer dev-key'

# Ensure all nodes are running
docker start miroir-meili-0 || true
docker start miroir-meili-1 || true
docker start miroir-meili-2 || true