Agent0

Two agents starting from the same base LLM are co-evolved:

Curriculum Agent: Uses RL to generate frontier tasks where the Executor has the highest uncertainty

Executor Agent: Learns to solve those tasks via RL

Tool integration as growth engine: When a code interpreter tool is added to the Executor, its problem-solving ability improves, which pressures the Curriculum to become more tool-aware and create harder problems, forming a virtuous cycle where difficulty and capability rise together.

Reward/Learning Design:

Curriculum rewards: (1) Executor's self-consistency-based uncertainty (maximum when p̂ is near 0.5) + (2) tool usage frequency, (3) repetition penalty for diversity

Executor filters tasks by p̂ to learn only from data that's "neither too easy nor too hard," uses majority vote pseudo-labels, and applies ADPO (ambiguity-aware) to adjust update strength and reduce label noise.

www.arxiv.org

https://www.arxiv.org/pdf/2511.16043

Agent0

Reward/Learning Design:

Recommendations