miroir/tests/benches/score-comparability
jedarden affb59fff6 P12.OP4: Validate RRF merge quality — τ=0.14 confirms DFS preflight is required
RRF merge (k=60) benchmarked against ground truth with 10K queries on
skewed 10-shard corpus (93% on shard 1). Result: Kendall τ = 0.1369
(95% CI [0.1339, 0.1399]), far below the 0.95 threshold. 9,998 of 10,000
queries fell below τ=0.95, confirming RRF alone is insufficient for
cross-shard ranking quality with skewed distributions.

DFS preflight (already implemented) achieves τ = 0.9818, passing the
threshold. Add full 10K-query DFS comparison report and fix paths in
experiment.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 05:43:42 -04:00
..
corpus P12.OP4: Score normalization at scale — research & benchmark infrastructure 2026-04-18 23:58:08 -04:00
queries P12.OP4: Score normalization at scale — research & benchmark infrastructure 2026-04-18 23:58:08 -04:00
results P12.OP4: Validate RRF merge quality — τ=0.14 confirms DFS preflight is required 2026-04-19 05:43:42 -04:00
README.md P12.OP4: Score normalization at scale — research & benchmark infrastructure 2026-04-18 23:58:08 -04:00
simulate.py Phase 1 Core Routing: validate and fix compilation 2026-04-19 03:22:33 -04:00

Score Comparability Benchmark

Tests whether _rankingScore values from different shards are comparable when documents are distributed unevenly across shards.

Problem Statement

Meilisearch's ranking pipeline computes scores using local statistics (term frequency, document frequency). When shards have very different document distributions, identical queries may return scores that aren't directly comparable, leading to incorrect merged rankings.

Experiment Design

  1. Ground truth: Single Meilisearch index with all documents
  2. Distributed setup: Same documents sharded across N nodes with intentional skew
  3. Measurement: Kendall tau (τ) between merged distributed results and ground truth
  4. Pass criterion: τ ≥ 0.95 on average across 10k random queries

Corpus Structure

  • 100,000 documents total
  • 10 shards (shard 0 = normal, shard 1 = 100× normal, shard 9 = 0.01× normal)
  • Documents have: id, title, content (synthetic text), category (for filtering)
  • 50 unique terms distributed across documents with varying frequencies

Directory Layout

  • corpus/: Test document sets (JSONL)
  • queries/: Generated query sets for experiments
  • results/: Experimental results and analysis

Running Experiments

See individual experiment scripts in results/ directories.