LLM Evaluation

Language Model Evaluation

LLM Benchmarks

LiveBench

AlpacaEval

AI2 Reasoning Challenge

GPQA

monotonicity

low variance

https://www.youtube.com/watch?v=2-SPH9hIKT8

LLM Evaluation Methods

Shepherd

LLM Evaluation Tools

Benchmarks are unreliable, see results from arena or trustworthy 3rd party

Jim Fan on Twitter / X

It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here're some tricks to practice magic at home:1. Train on paraphrased examples of the test set. "LLM-decontaminator" paper from LMSys found that you can beat GPT-4 with a 13B model (!!)… pic.twitter.com/iMKHBJH4eG— Jim Fan (@DrJimFan) September 9, 2024

https://x.com/DrJimFan/status/1833160432833716715

LLM Leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Considerations for model evaluation

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/evaluate/considerations

LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis

Discover amazing ML apps made by the community

https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard

arxiv.org

https://arxiv.org/pdf/2309.16609.pdf

Evaluating LLMs is complex so more comprehensive and purpose-specific evaluation methods is needed to assess their capabilities for various real-world applications

Evaluations are all we need

On analysing talent in LLMs

https://www.strangeloopcanon.com/p/evaluations-are-all-we-need

Types

Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation

I summarise the kinds of evaluations that are needed for a structured data generation task.

https://mlops.systems/posts/2024-06-25-evaluation-finetuning-manual-dataset.html

Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation

LLM Evaluation

Language Model Evaluation

LLM Leaderboard

Recommendations