Dynamic steer adapting + simple gating / simple feature choice
first abstract.tex hook sentance.
There are emerging many sidecar networks evolvuing to attached on llm's activation for special perpose. Sparse Autoencoder to decompose the activation or Natural language autoencoder is for interpreting the model. we could train a model which is specialized on training; the model has fundamental purpose is steering without breaking the model instead of the interpretability. Steering Autoencoder: reconstruct the activation of the model by refining the steering capability. This model incentivized by the finding the directions of the minimally breaking the model intelligence and maximizing the semantic changes.
As Lora is attached to the parameter space we can attack model to activation only to steer the model.
motivation: corrsteer 가 train 되는 상황은 steering 없는데 test 는 steering 되는 거에서 한다. train 도중에 feature steering 을 해가면서 직접 최적 feature 찾는건 어떨가
- feature 를 레이어당 며책가 아니라 global correlation only 로 찾아볼가
poc 성능향상 되는건 검증함 이미
여기다가 선택된 feature 에 model 의 stability 낮거나 자신감이 높은 상황에서는 steering 을 게이트로 안하게
그리고 어떻게 같은 sample 에서도 다른 feature 선택할지는 고민인데, 위에서 gating 문제는 여러개로 해결할수 있지만 feature 선택은 여전히 고민이다
- neural network way but unsupervised
- apply corrsteer for each token branch from few samples
- prior is sample or activation p(feature|activation) that maximize the accuracy but now trained on rl but using baysian we can do (rewarrd|activaiton) and p(reward|feature, activation) to p(feature|reward, activation)
- inductive bias to measure correlation between activation and correctness
- we still need a network maybe ?
token manifold?
steering coedf 에 따른 연속적 distribiton shift modeling 이 필요하다 분석이나
remove sae dependency
realtime
- PLS Regression based CorrSteer
dynamic (different feature)
conditional (some token)
interpretable
AI Overthinking Manifold Steering
Seonglae Cho