Two agents starting from the same base LLM are co-evolved:
- Curriculum Agent: Uses RL to generate frontier tasks where the Executor has the highest uncertainty
- Executor Agent: Learns to solve those tasks via RL
Tool integration as growth engine: When a code interpreter tool is added to the Executor, its problem-solving ability improves, which pressures the Curriculum to become more tool-aware and create harder problems, forming a virtuous cycle where difficulty and capability rise together.
Reward/Learning Design:
Curriculum rewards: (1) Executor's self-consistency-based uncertainty (maximum when p̂ is near 0.5) + (2) tool usage frequency, (3) repetition penalty for diversity
- Executor filters tasks by p̂ to learn only from data that's "neither too easy nor too hard," uses majority vote pseudo-labels, and applies ADPO (ambiguity-aware) to adjust update strength and reduce label noise.

Seonglae Cho