Motivation
- brain analogy
- emergent misalignment
Research Objective
- Task Circuit Discovery
- RLVR
- Practical Interpretability
Scientific Contributions
- Extending Steering
- Training Method
Method
rl amplify with citation
Results
Future Works
- pretraining based on follow activation making non selected
- local corrsteer like grpo based normalization estimation correlation
- Token entropy reward
Seonglae Cho