LLM Judge Model

Creator

Creator

Seonglae Cho

Created

Created

2025 Jun 17 10:56

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Jun 17 10:57

Refs

Refs

Finding GPT-4’s mistakes with GPT-4

CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF

https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Finding GPT-4’s mistakes with GPT-4

Judge only model

Selene

Frontier AI needs frontier evaluators. Meet Selene.

Discover Atla Selene 1, the world's most accurate LLM Judge for evaluating AI responses. Outperforming frontier models from OpenAI, Anthropic, DeepSeek, and others across 11 benchmarks for evaluators, Selene provides accurate scores and actionable feedback. Make effective custom eval criteria in a few steps using our new Alignment Platform. Start for free today.

Frontier AI needs frontier evaluators. Meet Selene.

https://www.atla-ai.com/post/selene-1

Frontier AI needs frontier evaluators. Meet Selene.

Announcing Atla’s native integration with Langfuse

Atla now integrates natively with Langfuse's LLM observability platform. Use Selene 1 as an “LLM-as-a-Judge” to run evals in Langfuse. Monitor app performance in production and run experiments over datasets. Discover how this powerful integration enhances LLM evaluations.

Announcing Atla’s native integration with Langfuse

https://www.atla-ai.com/post/langfuse-native-integration

Announcing Atla’s native integration with Langfuse

Thinking-LLM-as-a-Judge (
CoT judge)

EvalPlanner is specific implementation of Thinking-LLM-as-a-Judge which plan

https://arxiv.org/pdf/2501.18099

Backlinks

Agent Reinforcement Trainer

Recommendations

//////////