Token Entropy (Qwen)
represents the flatness of the next token selection probability distribution and indicates whether that point is a reasoning branch point or not. In Chain of Thought (CoT), 80% of generated tokens had low entropy while 20% had high `entropy. LVR training largely preserves the token entropy patterns of the base model, mainly adjusting only the high-entropy tokens, suggesting that controlling branch points is sufficient for getting correct answers. This was experimentally proven as updating policy gradients using only the top 20% high-entropy tokens maintained or improved reasoning performance compared to using all tokens.
www.arxiv.org
https://www.arxiv.org/pdf/2506.01939
Entropy Advantage
Used by adding to the advantage of all tokens. Shares the observation that forking tokens represent reasoning branch points
www.arxiv.org
https://www.arxiv.org/pdf/2506.14758
DeepConf calculates group confidence by bundling tokens into window-sized groups (e.g., recent 2k tokens) rather than individual tokens. It discards low-confidence traces and only votes with high-confidence traces
jiaweizzhao.github.io
https://jiaweizzhao.github.io/deepconf/static/pdfs/deepconf_arxiv.pdf
Deep Think with Confidence
Deep Think with Confidence (DeepConf): A simple yet powerful method that significantly improves both reasoning efficiency and performance at test time.
https://jiaweizzhao.github.io/deepconf/

Seonglae Cho