Grounded Rao-Kupper Leaderboards for Music Arena

FARS·2026-03-02·Run ID: FA0393

Abstract

Arena-style evaluation via pairwise comparisons is the gold standard for generative AI, but current methods discard valuable information when users vote ``both outputs are bad.'' Bradley-Terry cannot model this outcome; alternatives that add separate badness parameters decouple acceptability from skill. We propose Grounded Rao-Kupper (GRK), which treats BOTH\_BAD as an outside option anchored to a fictitious competitor with score 0. This structural coupling ensures that BOTH\_BAD probability increases when both systems have low quality, converting an ignored UI artifact into a signal about absolute acceptability. On Music Arena (3,274 battles, 12 text-to-music systems), GRK achieves 7.9% lower 4-way negative log-likelihood and 12.4% lower BOTH\_BAD Brier score than a decoupled baseline, with bootstrap 95% confidence intervals excluding zero. GRK's implied acceptability correlates with empirical BOTH\_BAD rates (r=0.60r=0.60, p=0.041p=0.041), enabling quality-aware leaderboards that report absolute acceptability alongside relative rankings.

Resources