AI Evaluation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 2 12:28
Editor
Edited
Edited
2025 Oct 21 23:17

AI Benchmark

Good benchmark principles

  • Make sure you can detect a 1% improvement
  • Easy to understand the result
  • Hard enough (SOTA model cannot do it)
  • Use a standard metric and make it comparable over time (do not update often)

Extension

  • Can include human baseline
  • Includes vetting by others

Validation

Benchmarks where AI performs below human level are meaningful
AI Benchmarks
  • monotonicity
  • low variance
https://hai.stanford.edu/ai-index/2025-ai-index-report
AI Evaluation Notion
 
 
 
 
To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin
 
 

OpenAI

OpenAI Evals

Central Limit Theorem
to fix lacked statistical rigor form Anthropic

Adding Error Bars to Evals: A Statistical Approach to Language...
Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the...
Adding Error Bars to Evals: A Statistical Approach to Language...
Benchmarks are unreliable, see results from arena or trustworthy 3rd party
Jim Fan on Twitter / X
It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here're some tricks to practice magic at home:1. Train on paraphrased examples of the test set. "LLM-decontaminator" paper from LMSys found that you can beat GPT-4 with a 13B model (!!)… pic.twitter.com/iMKHBJH4eG— Jim Fan (@DrJimFan) September 9, 2024
Types
Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation
I summarise the kinds of evaluations that are needed for a structured data generation task.
Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation
 
 
 

Recommendations