Dr. Zero

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 29 10:44
Editor
Edited
Edited
2026 Feb 3 14:28
As human-generated QA data becomes expensive and scarce, self-evolution without training data—where agents create and solve their own problems to improve—has become critical. However, search-based multi-turn agents face issues: (i) lack of question diversity (one-hop bias), (ii) difficulty scaling curriculum difficulty, and (iii) excessive computational cost due to GRPO's nested sampling.
Dr. Zero / DeepResearch-Zero: Two agents (Proposer–Solver) start from the same base LLM. The Proposer interleaves search to generate questions that are "verifiable, challenging yet solvable." The Solver attempts these questions and updates via reinforcement learning. As the Solver improves, the Proposer is incentivized to create harder questions, forming an automatic curriculum loop.
To reduce computational cost, they propose HRPO (hop-grouped RPO): questions are grouped by similar structure (e.g., hop count: 1~4-hop), and advantage is calculated using group-level baselines → this avoids GRPO's nested sampling of "multiple questions × multiple answers" while maintaining training stability.
 
 
 
 
Dr. Zero: Self-Evolving Search Agents without Training Data
As high-quality data becomes increasingly difficult to obtain, data-free self-evolution has emerged as a promising paradigm. This approach allows large language models (LLMs) to autonomously...
Dr. Zero: Self-Evolving Search Agents without Training Data
 
 
 

Recommendations