ART
While maintaining GRPO as is, we introduce the concept of "trajectory units" to enable learning of multi-step reasoning or agent interaction. "Steps" are divided by agent action units. A step is defined as "the moment when an agent performs an action once and receives an observation from the environment (or tool)."
Relative Universal LLM-Elicited Rewards
GRPO focuses on reward normalization between samples, while RULER implements reward normalization within trajectories. LLM Judge Model uses 0-1 values, so it's effectively normalization but not exactly.
Training AI Agents with RL | Unsloth Documentation
Learn how to train AI agents for real-world tasks using Reinforcement Learning (RL).
https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/training-ai-agents-with-rl

RULER - ART
Learn how to use RULER to automatically reward your agents.
https://art.openpipe.ai/fundamentals/ruler

Seonglae Cho