LLM as a Judge

Creator

Creator

Seonglae Cho

Created

Created

2024 Nov 22 23:37

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Mar 6 19:2

Refs

Refs

LLM Judge

Using continuous scores for LLM as a judge is not effective. LLMs perform better when making categorical judgments. It is recommended to first have LLM judges make categorical assessments, which can then be aggregated into continuous metrics if needed.

Limitation

egocentric - prefer himself (
AI Introspection)

LLM Judge Types

Pairwise LLM Judge

LLM Judge Scorer

LLM Judge Model

weakness

https://arxiv.org/pdf/2505.15795

political-neutrality-eval
anthropics • Updated 2026 Mar 5 7:22

Inter-model evaluation agreement rates:

Claude Sonnet 4.5 ↔ GPT-5

92% agreement

Claude Opus 4.1 ↔ Sonnet 4.5

94% agreement

Human evaluator agreement is around 85% → Model graders are more consistent than humans.

Measuring political bias in Claude

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

https://www.anthropic.com/news/political-even-handedness

Measuring political bias in Claude

Backlinks

Recommendations

/////////