LLM Leaderboard

Created
Created
2023 Aug 29 8:38
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Jul 15 9:27
 
 

Illusion

Major companies like Meta, Google, and Amazon privately test multiple versions and only publish their highest scores. This violates the fair sampling assumption of the Bradley-Terry model. There are unfair advantages due to differences in API calls, sampling rates, and model maintenance policies. Scores can be improved by fine-tuning on Arena data, and there are discrepancies between official withdrawals and vote-based eliminations.

LLM Leaderboard

 

Leaderboard

Per model layer analysis

Korean Leaderboard

 
 
 

Backlinks

LLMLLM

Recommendations