Pairwise comparison exhibits non-transitivity which can lead to unstable evaluation rankings. That is, when A>B and B>C, A>C is not necessarily preferred, causing rankings to vary depending on the choice of baseline model. In other words, Syllogism does not hold and rock-paper-scissors-like preference loops can occur.
To address this, a round-robin tournament comparing all models is combined with the Bradley-Terry model to obtain more consistent rankings, and Swiss-Wise Iterative Matchmaking (SWIM) is proposed to improve efficiency by reducing computational costs. The proposed method improved correlation with human evaluation benchmark Chatbot Arena from Spearman 95.0% to 96.4% and Kendall 82.1% to 86.3%.