jedarden/miroir

Fork 0

Commit graph

Author	SHA1	Message	Date
jedarden	9ce1b36206	P12.OP4: Add confidence intervals to score comparability benchmark Research doc updated with precise 95% CIs per query type. compare.py now computes and reports confidence intervals. Kendall τ = 0.79 (95% CI [0.7873, 0.8006]) confirms raw score merging is not viable; RRF already implemented in merger.rs as mitigation. Follow-up bead created (miroir-zfo) for RRF quality validation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 00:07:42 -04:00
jedarden	72f9a197b5	P12.OP4: Score normalization at scale — research & benchmark infrastructure Completed Plan §15 Open Problem #4 research on cross-shard score comparability. ## Key Finding Average Kendall tau: 0.79 vs. 0.95 threshold — FAIL Cross-shard score comparability is a significant issue: - Common-term queries: τ = 0.15 (catastrophic) - Local IDF statistics cause score inflation on small shards - Documents from 10-doc shards outrank 93K-doc shard results ## Recommendation Implement Reciprocal Rank Fusion (RRF) for result merging. Follow-up bead: miroir-nsu ## Artifacts Added - Benchmark infrastructure: tests/benches/score-comparability/ - Corpus generator with extreme shard skew (100× variance) - Query generator (10K random queries across 5 types) - BM25-based simulation with global vs local IDF - Kendall tau comparison tool - Full experimental results (τ = 0.79 ± 0.01, 95% CI) - Research writeup: docs/research/score-normalization-at-scale.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 23:58:08 -04:00

Author

SHA1

Message

Date

jedarden

9ce1b36206

P12.OP4: Add confidence intervals to score comparability benchmark

Research doc updated with precise 95% CIs per query type. compare.py
now computes and reports confidence intervals. Kendall τ = 0.79
(95% CI [0.7873, 0.8006]) confirms raw score merging is not viable;
RRF already implemented in merger.rs as mitigation. Follow-up bead
created (miroir-zfo) for RRF quality validation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-19 00:07:42 -04:00

jedarden

72f9a197b5

P12.OP4: Score normalization at scale — research & benchmark infrastructure

Completed Plan §15 Open Problem #4 research on cross-shard score comparability.

## Key Finding
Average Kendall tau: 0.79 vs. 0.95 threshold — FAIL

Cross-shard score comparability is a significant issue:
- Common-term queries: τ = 0.15 (catastrophic)
- Local IDF statistics cause score inflation on small shards
- Documents from 10-doc shards outrank 93K-doc shard results

## Recommendation
Implement Reciprocal Rank Fusion (RRF) for result merging.
Follow-up bead: miroir-nsu

## Artifacts Added
- Benchmark infrastructure: tests/benches/score-comparability/
  - Corpus generator with extreme shard skew (100× variance)
  - Query generator (10K random queries across 5 types)
  - BM25-based simulation with global vs local IDF
  - Kendall tau comparison tool
  - Full experimental results (τ = 0.79 ± 0.01, 95% CI)
- Research writeup: docs/research/score-normalization-at-scale.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-18 23:58:08 -04:00

2 commits