Language Model Evaluation
LLM Benchmarks
- monotonicity
- low variance
Language Model Evaluation
Model Evaluation Tools
Benchmarks are unreliable, see results from arena or trustworthy 3rd party
LLM Leaderboard
Evaluating LLMs is complex so more comprehensive and purpose-specific evaluation methods is needed to assess their capabilities for various real-world applications
Types