AI Benchmark
Good benchmark principles
- Make sure you can detect a 1% improvement
- Easy to understand the result
- Hard enough (SOTA model cannot do it)
- Use a standard metric and make it comparable over time (do not update often)
Extension
- Can include human baseline
- Includes vetting by others
Validation
Benchmarks where AI performs below human level are meaningful
AI Benchmarks
- monotonicity
- low variance

AI Evaluation Notion
To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin
OpenAI
Central Limit Theorem to fix lacked statistical rigor form Anthropic
Benchmarks are unreliable, see results from arena or trustworthy 3rd party
Types

Seonglae Cho
