LLM Leaderboard

Creator

Creator

Seonglae Cho

Created

Created

2023 Aug 29 8:38

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Dec 12 17:7

Refs

Refs

Transformer Model

Meta Learning

Illusion

Major companies like Meta, Google, and Amazon privately test multiple versions and only publish their highest scores. This violates the fair sampling assumption of the Bradley-Terry model. There are unfair advantages due to differences in API calls, sampling rates, and model maintenance policies. Scores can be improved by fine-tuning on Arena data, and there are discrepancies between official withdrawals and vote-based eliminations.

The Leaderboard Illusion

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has...

https://arxiv.org/abs/2504.20879

LLM Leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Considerations for model evaluation

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/evaluate/considerations

Considerations for model evaluation

LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis

Discover amazing ML apps made by the community

https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard

LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis

IQ (feels accurate) that o1 preview is way better than o1

Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithms behind the technology revolution.

https://trackingai.org/home

O1 is less powerful than O1-preview due to the less time it spends on thinking (compute time)

389 votes, 135 comments. I have asked the same coding question to both models 10 times and checked whether the code each model produced compiled…

https://www.reddit.com/r/OpenAI/comments/1h7qtaf/o1_is_less_powerful_than_o1preview_due_to_the/

O1 is less powerful than O1-preview due to the less time it spends on thinking (compute time)

Leaderboard

https://chat.lmsys.org/?leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Per model layer analysis

Hugging Face에서 살펴보는 다양한 Transformer 모델들

데보션 (DEVOCEAN) 기술 블로그 , 개발자 커뮤니티이자 내/외부 소통과 성장 플랫폼

https://devocean.sk.com/blog/techBoardDetail.do?ID=165670&boardType=techBlog&searchData=&page=&subIndex=최신+기술+블로그

Hugging Face에서 살펴보는 다양한 Transformer 모델들

Korean Leaderboard

Open Ko-LLM Leaderboard - a Hugging Face Space by upstage

Discover amazing ML apps made by the community

https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard

Open Ko-LLM Leaderboard - a Hugging Face Space by upstage

'마의 장벽' GPT-4 깰까… 세계 1등 4번 찍은 K-언어모델

국내 인공지능(AI) 기업들이 ‘거대언어모델(LLM)의 수능’으로 불리는 허깅페이스 ‘오픈 LLM 리더보드’에서 잇따라 1위를 차지했다. 국내 기업이 해외 빅테크(대형 정보기술 기업)에 견줄 만한 기술력을 갖췄다는 평가다. 현재 가장 우수한 모델인 오픈AI의 GPT-4 수준에 도달할 수 있을지도 관심이다. 24일 기준 허깅페이스 오픈 LLM 리더보드를 보면

https://v.daum.net/v/20240125070054408

'마의 장벽' GPT-4 깰까… 세계 1등 4번 찍은 K-언어모델

OpenRouter accounts for 1% of API usage but approximately shows market share

LLM Rankings | OpenRouter

Language models ranked and analyzed by usage across apps

https://openrouter.ai/rankings

LLM Rankings | OpenRouter

Recommendations

////////