Self-play has worked well in RL, but hasn't worked as well in LLMs so far for the following reasons. This is because the self-play process doesn't involve reality. In other words, the core of self-play is to leverage the choices of various agents based on direct interaction with the environment, and this core lies in whether the environment is included. However, unlike Go where integration is possible with fixed rules, LLMs only involve the limited parts that interact with reality through the interface of natural language.
In addition to this problem, there are also limitations in the diversity of the models themselves. Most modern LLMs use nearly identical web datasets for pretraining and use the data equally. And while RL adds preferences, it basically only changes preferences within the same structure without significantly expanding the reality set that interacts with the environment. So unlike Go which involves the entire environment, we are only involving the parts of the world that touch natural language, which means we need to involve in this loop the parts that are not accessed through language or vision.
Self Play Methods
Ilya Sutskever 2025
AGI is intelligence that can learn to do anything. The deployment of AGI has gradualism as an inherent component of any plan. This is because the way the future approaches typically isn't accounted for in predictions, which don't consider gradualism. The difference lies in what to release first.
The term AGI itself was born as a reaction to past criticisms of narrow AI. It was needed to describe the final state of AI. Pre-training is the keyword for new generalization and had a strong influence. The fact that RL is currently task-specific is part of the process of erasing this imprint of generality. First of all, humans don't memorize all information like pre-training does. Rather, they are intelligence that is well optimized for Continual Learning by adapting to anything and managing the Complexity-Robustness Tradoff.

Seonglae Cho