LLM as a Judge

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Nov 22 23:37
Editor
Edited
Edited
2026 Mar 6 19:2
Refs
Refs

LLM Judge

Using continuous scores for LLM as a judge is not effective. LLMs perform better when making categorical judgments. It is recommended to first have LLM judges make categorical assessments, which can then be aggregated into continuous metrics if needed.

Limitation

LLM Judge Types
 
 
 
weakness
arxiv.org

political-neutrality-eval
anthropicsUpdated 2026 Mar 5 7:22

Inter-model evaluation agreement rates:
  • Claude Sonnet 4.5 ↔ GPT-5
    • 92% agreement
  • Claude Opus 4.1 ↔ Sonnet 4.5
    • 94% agreement
Human evaluator agreement is around 85% → Model graders are more consistent than humans.
Measuring political bias in Claude
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Measuring political bias in Claude
 
 

Recommendations