AI Evaluation

Creator
Creator
Seonglae Cho
Created
Created
2023 Jun 2 12:28
Editor
Edited
Edited
2025 May 20 16:17

AI Benchmark

Good benchmark principles

  • Make sure you can detect a 1% improvement
  • Easy to understand the result
  • Hard enough (SOTA model cannot do it)
  • Use a standard metric and make it comparable over time (do not update often)

Extension

  • Can include human baseline
  • Includes vetting by others
LLM Benchmarks
 

Validation

사람의 성능보다 AI 가 낮은 benchmark들이 의미있음
AI Benchmarks
  • monotonicity
  • low variance
https://hai.stanford.edu/ai-index/2025-ai-index-report
Language Model Metrics
 

NLP

Model Evaluation Tools
 
 
 
To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin
 
 
While
MMLU
is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like
BBQ Benchmark
,
Big-Bench
, and
HELM
are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.

Central Limit Theorem
to fix lacked statistical rigor form Anthropic

Benchmarks are unreliable, see results from arena or trustworthy 3rd party
Evaluating LLMs is complex so more comprehensive and purpose-specific evaluation methods is needed to assess their capabilities for various real-world applications
Types
 
 
 

Recommendations