Control RL Future Work

inference time alignment

성능향상 tracking 방식 제안 model diffing 느낌인가

linear layer 에서 deep layer 로 전환

cot 학습모델 있으면 그거 token dist 따라하도록 sae 선택하는 모델 갖도록 하는거가 더 낫나 혹은 그냥 transcoder 처럼 residual stream to cot residual stream 모델 하나 학습시키면 안되나 interpretable 하지는 않을수도

같은 sae 사용해 cot trainied model feature 랑 내꺼 feature 비교

지금 grpo 시 kl 을 어디 주는지 refernece policy 없는데 - 기존 llm 역할

이거 요즘 없어지는 추새라 필요없긴 한데 해야한다면 token distribution 이어야할듯

does not necessarilly have to be MLP so better control model

Control RL Future Work

Recommendations