miroir/tests/benches/score-comparability/README.md
jedarden 72f9a197b5 P12.OP4: Score normalization at scale — research & benchmark infrastructure
Completed Plan §15 Open Problem #4 research on cross-shard score comparability.

## Key Finding
Average Kendall tau: 0.79 vs. 0.95 threshold — FAIL

Cross-shard score comparability is a significant issue:
- Common-term queries: τ = 0.15 (catastrophic)
- Local IDF statistics cause score inflation on small shards
- Documents from 10-doc shards outrank 93K-doc shard results

## Recommendation
Implement Reciprocal Rank Fusion (RRF) for result merging.
Follow-up bead: miroir-nsu

## Artifacts Added
- Benchmark infrastructure: tests/benches/score-comparability/
  - Corpus generator with extreme shard skew (100× variance)
  - Query generator (10K random queries across 5 types)
  - BM25-based simulation with global vs local IDF
  - Kendall tau comparison tool
  - Full experimental results (τ = 0.79 ± 0.01, 95% CI)
- Research writeup: docs/research/score-normalization-at-scale.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:58:08 -04:00

1.3 KiB
Raw Blame History

Score Comparability Benchmark

Tests whether _rankingScore values from different shards are comparable when documents are distributed unevenly across shards.

Problem Statement

Meilisearch's ranking pipeline computes scores using local statistics (term frequency, document frequency). When shards have very different document distributions, identical queries may return scores that aren't directly comparable, leading to incorrect merged rankings.

Experiment Design

  1. Ground truth: Single Meilisearch index with all documents
  2. Distributed setup: Same documents sharded across N nodes with intentional skew
  3. Measurement: Kendall tau (τ) between merged distributed results and ground truth
  4. Pass criterion: τ ≥ 0.95 on average across 10k random queries

Corpus Structure

  • 100,000 documents total
  • 10 shards (shard 0 = normal, shard 1 = 100× normal, shard 9 = 0.01× normal)
  • Documents have: id, title, content (synthetic text), category (for filtering)
  • 50 unique terms distributed across documents with varying frequencies

Directory Layout

  • corpus/: Test document sets (JSONL)
  • queries/: Generated query sets for experiments
  • results/: Experimental results and analysis

Running Experiments

See individual experiment scripts in results/ directories.