Latent Feature Level Steering
- Top-1 Influential Layer Identification: Train a classifier to predict consistency using hidden states of paraphrase pairs → Select the layer with highest influence
- Calculate feature differences between paraphrase pairs that give correct/incorrect answers → Select features above a threshold
arxiv.org
https://arxiv.org/pdf/2501.11036v2

Seonglae Cho