Illusion
Major companies like Meta, Google, and Amazon privately test multiple versions and only publish their highest scores. This violates the fair sampling assumption of the Bradley-Terry model. There are unfair advantages due to differences in API calls, sampling rates, and model maintenance policies. Scores can be improved by fine-tuning on Arena data, and there are discrepancies between official withdrawals and vote-based eliminations.
The Leaderboard Illusion
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has...
https://arxiv.org/abs/2504.20879

LLM Leaderboard
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
Discover amazing ML apps made by the community
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Considerations for model evaluation
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/docs/evaluate/considerations
LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis
Discover amazing ML apps made by the community
https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard
IQ (feels accurate) that o1 preview is way better than o1
Tracking AI
Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithms behind the technology revolution.
https://trackingai.org/home
O1 is less powerful than O1-preview due to the less time it spends on thinking (compute time)
389 votes, 135 comments. I have asked the same coding question to both models 10 times and checked whether the code each model produced compiled…
https://www.reddit.com/r/OpenAI/comments/1h7qtaf/o1_is_less_powerful_than_o1preview_due_to_the/
Leaderboard
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
Discover amazing ML apps made by the community
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Per model layer analysis
Hugging Face에서 살펴보는 다양한 Transformer 모델들
데보션 (DEVOCEAN) 기술 블로그 , 개발자 커뮤니티이자 내/외부 소통과 성장 플랫폼
https://devocean.sk.com/blog/techBoardDetail.do?ID=165670&boardType=techBlog&searchData=&page=&subIndex=최신+기술+블로그

Korean Leaderboard
Open Ko-LLM Leaderboard - a Hugging Face Space by upstage
Discover amazing ML apps made by the community
https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard
'마의 장벽' GPT-4 깰까… 세계 1등 4번 찍은 K-언어모델
국내 인공지능(AI) 기업들이 ‘거대언어모델(LLM)의 수능’으로 불리는 허깅페이스 ‘오픈 LLM 리더보드’에서 잇따라 1위를 차지했다. 국내 기업이 해외 빅테크(대형 정보기술 기업)에 견줄 만한 기술력을 갖췄다는 평가다. 현재 가장 우수한 모델인 오픈AI의 GPT-4 수준에 도달할 수 있을지도 관심이다. 24일 기준 허깅페이스 오픈 LLM 리더보드를 보면
https://v.daum.net/v/20240125070054408
OpenRouter accounts for 1% of API usage but approximately shows market share

Seonglae Cho