LLM Judge
Using continuous scores for LLM as a judge is not effective. LLMs perform better when making categorical judgments. It is recommended to first have LLM judges make categorical assessments, which can then be aggregated into continuous metrics if needed.
Limitation
- egocentric - prefer himself (AI Introspection)
LLM Judge Types
weakness
arxiv.org
https://arxiv.org/pdf/2505.15795
political-neutrality-evalanthropics • Updated 2026 Mar 5 7:22
political-neutrality-eval
anthropics • Updated 2026 Mar 5 7:22
Inter-model evaluation agreement rates:
- Claude Sonnet 4.5 ↔ GPT-5
- 92% agreement
- Claude Opus 4.1 ↔ Sonnet 4.5
- 94% agreement
Human evaluator agreement is around 85% → Model graders are more consistent than humans.
Measuring political bias in Claude
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
https://www.anthropic.com/news/political-even-handedness

Seonglae Cho