AI Benchmark
Good benchmark principles
- Make sure you can detect a 1% improvement
- Easy to understand the result
- Hard enough (SOTA model cannot do it)
- Use a standard metric and make it comparable over time (do not update often)
Extension
- Can include human baseline
- Includes vetting by others
Validation
Benchmarks where AI performs below human level are meaningful
AI Benchmarks
- monotonicity
- low variance

AI Evaluation Notion
To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin
OpenAI
Central Limit Theorem to fix lacked statistical rigor form Anthropic
Adding Error Bars to Evals: A Statistical Approach to Language...
Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the...
https://arxiv.org/abs/2411.00640

Benchmarks are unreliable, see results from arena or trustworthy 3rd party
Jim Fan on Twitter / X
It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here're some tricks to practice magic at home:1. Train on paraphrased examples of the test set. "LLM-decontaminator" paper from LMSys found that you can beat GPT-4 with a 13B model (!!)… pic.twitter.com/iMKHBJH4eG— Jim Fan (@DrJimFan) September 9, 2024
https://x.com/DrJimFan/status/1833160432833716715
Types
Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation
I summarise the kinds of evaluations that are needed for a structured data generation task.
https://mlops.systems/posts/2024-06-25-evaluation-finetuning-manual-dataset.html


Seonglae Cho