Agent Reinforcement Trainer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 6 9:48
Editor
Edited
Edited
2025 Oct 8 23:37
Refs
Refs

ART

While maintaining GRPO as is, we introduce the concept of "trajectory units" to enable learning of multi-step reasoning or agent interaction. "Steps" are divided by agent action units. A step is defined as "the moment when an agent performs an action once and receives an observation from the environment (or tool)."

Relative Universal LLM-Elicited Rewards

GRPO focuses on reward normalization between samples, while RULER implements reward normalization within trajectories.
LLM Judge Model
uses 0-1 values, so it's effectively normalization but not exactly.
 
 
 
 
 
 

Recommendations