LLM Leaderboard

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Aug 29 8:38
Editor
Edited
Edited
2025 Dec 12 17:7
 
 

Illusion

Major companies like Meta, Google, and Amazon privately test multiple versions and only publish their highest scores. This violates the fair sampling assumption of the Bradley-Terry model. There are unfair advantages due to differences in API calls, sampling rates, and model maintenance policies. Scores can be improved by fine-tuning on Arena data, and there are discrepancies between official withdrawals and vote-based eliminations.
The Leaderboard Illusion
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has...
The Leaderboard Illusion

LLM Leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
Discover amazing ML apps made by the community
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
Considerations for model evaluation
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Considerations for model evaluation
LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis
Discover amazing ML apps made by the community
LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis
 

Leaderboard

chat.lmsys.org
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
Discover amazing ML apps made by the community
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Per model layer analysis

Hugging Face에서 살펴보는 다양한 Transformer 모델들
데보션 (DEVOCEAN) 기술 블로그 , 개발자 커뮤니티이자 내/외부 소통과 성장 플랫폼
Hugging Face에서 살펴보는 다양한 Transformer 모델들

Korean Leaderboard

Open Ko-LLM Leaderboard - a Hugging Face Space by upstage
Discover amazing ML apps made by the community
Open Ko-LLM Leaderboard - a Hugging Face Space by upstage
'마의 장벽' GPT-4 깰까… 세계 1등 4번 찍은 K-언어모델
국내 인공지능(AI) 기업들이 ‘거대언어모델(LLM)의 수능’으로 불리는 허깅페이스 ‘오픈 LLM 리더보드’에서 잇따라 1위를 차지했다. 국내 기업이 해외 빅테크(대형 정보기술 기업)에 견줄 만한 기술력을 갖췄다는 평가다. 현재 가장 우수한 모델인 오픈AI의 GPT-4 수준에 도달할 수 있을지도 관심이다. 24일 기준 허깅페이스 오픈 LLM 리더보드를 보면
'마의 장벽' GPT-4 깰까… 세계 1등 4번 찍은 K-언어모델
OpenRouter accounts for 1% of API usage but approximately shows market share
LLM Rankings | OpenRouter
Language models ranked and analyzed by usage across apps
LLM Rankings | OpenRouter
 

Backlinks

LLMLLM

Recommendations