WMDP Bench RLHF Linear Representation Hypothesis Alignment Faking Sparse Feature Circuit SAEBench NNSight Activation Steering scholar.google.comhttps://scholar.google.com/citations?user=fW7yK10AAAAJ&hl=en