AI Evaluation

Creator

Creator

Seonglae Cho

Created

Created

2023 Jun 2 12:28

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Dec 28 14:25

Refs

Refs

Multimodal Benchmark

Good benchmark principles

Make sure you can detect a 1% improvement

Easy to understand the result

Hard enough (SOTA model cannot do it)

Use a standard metric and make it comparable over time (do not update often)

Extension

Can include human baseline

Includes vetting by others

LLM Benchmarks

Human Evaluation

Agentic Evaluation

Prove-based Evaluation

Dataset-based AI Benchmark

monotonicity

low variance

https://www.youtube.com/watch?v=2-SPH9hIKT8

Language Model Metrics

notion image

Model Evaluation Tools

HuggingFace Evaluate

Central Limit Theorem to fix lacked statistical rigor form Anthropic

Adding Error Bars to Evals: A Statistical Approach to Language...

Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the...

Adding Error Bars to Evals: A Statistical Approach to Language...

https://arxiv.org/abs/2411.00640

Adding Error Bars to Evals: A Statistical Approach to Language...

Benchmarks are unreliable, see results from arena or trustworthy 3rd party

Jim Fan on Twitter / X

It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here're some tricks to practice magic at home:1. Train on paraphrased examples of the test set. "LLM-decontaminator" paper from LMSys found that you can beat GPT-4 with a 13B model (!!)… pic.twitter.com/iMKHBJH4eG— Jim Fan (@DrJimFan) September 9, 2024

Jim Fan on Twitter / X

https://x.com/DrJimFan/status/1833160432833716715

LLM Leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Considerations for model evaluation

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Considerations for model evaluation

https://huggingface.co/docs/evaluate/considerations

Considerations for model evaluation

LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis

Discover amazing ML apps made by the community

LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis

https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard

LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis

https://arxiv.org/pdf/2309.16609.pdf

Evaluating LLMs is complex so more comprehensive and purpose-specific evaluation methods is needed to assess their capabilities for various real-world applications

Evaluations are all we need

On analysing talent in LLMs

Evaluations are all we need

https://www.strangeloopcanon.com/p/evaluations-are-all-we-need

Evaluations are all we need

Types

Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation

I summarise the kinds of evaluations that are needed for a structured data generation task.

https://mlops.systems/posts/2024-06-25-evaluation-finetuning-manual-dataset.html

Alex Strick van Linschoten - How to think about creating a dataset for LLM finetuning evaluation

Validation

사람의 성능보다 AI 가 낮은 benchmark들이 의미있음

AI Benchmarks

Recommendations

///////