AI Benchmark
Good benchmark principles
- Make sure you can detect a 1% improvement
- Easy to understand the result
- Hard enough (SOTA model cannot do it)
- Use a standard metric and make it comparable over time (do not update often)
Extension
- Can include human baseline
- Includes vetting by others
LLM Benchmarks
Validation
사람의 성능보다 AI 가 낮은 benchmark들이 의미있음
AI Benchmarks
- monotonicity
- low variance
Language Model Metrics
NLP
Model Evaluation Tools
To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin
Central Limit Theorem to fix lacked statistical rigor form Anthropic
Benchmarks are unreliable, see results from arena or trustworthy 3rd party
Evaluating LLMs is complex so more comprehensive and purpose-specific evaluation methods is needed to assess their capabilities for various real-world applications
Types
Every time we solve something previously out of reach, it turns out that human-level generality is even further out of reach.