Finding GPT-4’s mistakes with GPT-4
CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF
https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Judge only model
Selene
Frontier AI needs frontier evaluators. Meet Selene.
Discover Atla Selene 1, the world's most accurate LLM Judge for evaluating AI responses. Outperforming frontier models from OpenAI, Anthropic, DeepSeek, and others across 11 benchmarks for evaluators, Selene provides accurate scores and actionable feedback. Make effective custom eval criteria in a few steps using our new Alignment Platform. Start for free today.
https://www.atla-ai.com/post/selene-1

Announcing Atla’s native integration with Langfuse
Atla now integrates natively with Langfuse's LLM observability platform. Use Selene 1 as an “LLM-as-a-Judge” to run evals in Langfuse. Monitor app performance in production and run experiments over datasets. Discover how this powerful integration enhances LLM evaluations.
https://www.atla-ai.com/post/langfuse-native-integration
.png?table=block&id=1c2c3c96-247d-80b9-aca1-d9122874f1a9&cache=v2)
Thinking-LLM-as-a-Judge (CoT judge)
EvalPlanner is specific implementation of Thinking-LLM-as-a-Judge which plan

Seonglae Cho
