Pairwise LLM Judge

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jun 7 15:55
Editor
Edited
Edited
2025 Jun 17 11:0
Refs
Refs
 
 
 
 
Pairwise comparison exhibits non-transitivity which can lead to unstable evaluation rankings. That is, when A>B and B>C, A>C is not necessarily preferred, causing rankings to vary depending on the choice of baseline model. In other words,
Syllogism
does not hold and rock-paper-scissors-like preference loops can occur.
To address this, a round-robin tournament comparing all models is combined with the Bradley-Terry model to obtain more consistent rankings, and Swiss-Wise Iterative Matchmaking (SWIM) is proposed to improve efficiency by reducing computational costs. The proposed method improved correlation with human evaluation benchmark Chatbot Arena from Spearman 95.0% to 96.4% and Kendall 82.1% to 86.3%.
 
 

Recommendations