Token Entropy (Qwen)
represents the flatness of the next token selection probability distribution and indicates whether that point is a reasoning branch point or not. In Chain of Thought (CoT), 80% of generated tokens had low entropy while 20% had high `entropy. LVR training largely preserves the token entropy patterns of the base model, mainly adjusting only the high-entropy tokens, suggesting that controlling branch points is sufficient for getting correct answers. This was experimentally proven as updating policy gradients using only the top 20% high-entropy tokens maintained or improved reasoning performance compared to using all tokens.
Used by adding to the advantage of all tokens. Shares the observation that forking tokens represent reasoning branch points
DeepConf calculates group confidence by bundling tokens into window-sized groups (e.g., recent 2k tokens) rather than individual tokens. It discards low-confidence traces and only votes with high-confidence traces