AI Evaluation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 2 12:28
Editor
Edited
Edited
2025 Oct 21 23:17

AI Benchmark

Good benchmark principles

  • Make sure you can detect a 1% improvement
  • Easy to understand the result
  • Hard enough (SOTA model cannot do it)
  • Use a standard metric and make it comparable over time (do not update often)

Extension

  • Can include human baseline
  • Includes vetting by others

Validation

Benchmarks where AI performs below human level are meaningful
AI Benchmarks
  • monotonicity
  • low variance
https://hai.stanford.edu/ai-index/2025-ai-index-report
AI Evaluation Notion
 
 
 
 
To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin
 
 

OpenAI

Central Limit Theorem
to fix lacked statistical rigor form Anthropic

Benchmarks are unreliable, see results from arena or trustworthy 3rd party
Types
 
 
 

Recommendations