Illusion
Major companies like Meta, Google, and Amazon privately test multiple versions and only publish their highest scores. This violates the fair sampling assumption of the Bradley-Terry model. There are unfair advantages due to differences in API calls, sampling rates, and model maintenance policies. Scores can be improved by fine-tuning on Arena data, and there are discrepancies between official withdrawals and vote-based eliminations.
LLM Leaderboard
Leaderboard
Per model layer analysis
Korean Leaderboard
OpenRouter accounts for 1% of API usage but approximately shows market share

Seonglae Cho
