Self-play has worked well in RL, but hasn't worked as well in LLMs so far for the following reasons. This is because the self-play process doesn't involve reality. In other words, the core of self-play is to leverage the choices of various agents based on direct interaction with the environment, and this core lies in whether the environment is included. However, unlike Go where integration is possible with fixed rules, LLMs only involve the limited parts that interact with reality through the interface of natural language.
In addition to this problem, there are also limitations in the diversity of the models themselves. Most modern LLMs use nearly identical web datasets for pretraining and use the data equally. And while RL adds preferences, it basically only changes preferences within the same structure without significantly expanding the reality set that interacts with the environment. So unlike Go which involves the entire environment, we are only involving the parts of the world that touch natural language, which means we need to involve in this loop the parts that are not accessed through language or vision.
Self Play Methods
Ilya Sutskever 2025

Seonglae Cho