AI Evaluation

AI Benchmark

Good benchmark principles

Make sure you can detect a 1% improvement

Easy to understand the result

Hard enough (SOTA model cannot do it)

Use a standard metric and make it comparable over time (do not update often)

Extension

Can include human baseline

Includes vetting by others

Validation

사람의 성능보다 AI 가 낮은 benchmark들이 의미있음

monotonicity

low variance

https://hai.stanford.edu/ai-index/2025-ai-index-report

NLP

Machine Reading Comprehension

Question answering AI

AI Summarization

Sentiment Analysis

AI Translate

To measure is to know, if you cannot measure it, you cannot improve it - Lord Kelvin

While

MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like

BBQ Benchmark,

Big-Bench, and

HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.

Challenges in evaluating AI systems

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

https://www.anthropic.com/research/evaluating-ai-systems

Central Limit Theorem to fix lacked statistical rigor form Anthropic

Adding Error Bars to Evals: A Statistical Approach to Language...

Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the...

https://arxiv.org/abs/2411.00640

Benchmarks are unreliable, see results from arena or trustworthy 3rd party

Jim Fan on Twitter / X

It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here're some tricks to practice magic at home:1. Train on paraphrased examples of the test set. "LLM-decontaminator" paper from LMSys found that you can beat GPT-4 with a 13B model (!!)… pic.twitter.com/iMKHBJH4eG— Jim Fan (@DrJimFan) September 9, 2024

https://x.com/DrJimFan/status/1833160432833716715

Evaluating LLMs is complex so more comprehensive and purpose-specific evaluation methods is needed to assess their capabilities for various real-world applications

Evaluations are all we need

On analysing talent in LLMs