ART
While maintaining GRPO as is, we introduce the concept of "trajectory units" to enable learning of multi-step reasoning or agent interaction. "Steps" are divided by agent action units. A step is defined as "the moment when an agent performs an action once and receives an observation from the environment (or tool)."
Relative Universal LLM-Elicited Rewards
GRPO focuses on reward normalization between samples, while RULER implements reward normalization within trajectories. LLM Judge Model uses 0-1 values, so it's effectively normalization but not exactly.

Seonglae Cho