Deep Causal Transcoding
Transcoder but not reconstruct, but steer with extracting latent behavior vector
- Jailbreak vector
even consider non-linear relation
- Jacobian
- Hessian
- Exponential DCT
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models — LessWrong
Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from t…
https://www.lesswrong.com/posts/fSRg5qs9TPbNy3sm5/deep-causal-transcoding-a-framework-for-mechanistically

Seonglae Cho