Control RL POC

Recently, I conducted a minimal POC experiment to improve the MMLU score of Gemma-2B: https://github.com/seonglae/steer-rl It appears that the best policy primarily mitigates simple hallucinations rather than enhancing reasoning performance. The issue seems to be the huge action space, making exploration tough. Needs better tuning and constraints for effective learning.

Hello!

My name is Seonglae Cho, and I am an MSc student at UCL. I am preparing my thesis on steering LLMs using a Sparse Autoencoder with RL. I was inspired by the AutoSteer work and am working on extending it to unlearning or general reasoning tasks. The idea is: the reward comes from the task, and the observation is the hidden state (residual stream) extracted from the LLM environment. The policy then decides how to manipulate the SAE dictionary for steering. The main challenge is the large action space due to the dictionary size, making exploration difficult.

Control RL POC

Recommendations