CLT (cross-layer Transcoder)

Unlike Crosscoder which only shares latent dimensions, CLT shares encoders and trains different decoders for subsequent layers. While PLT (Per Layer Transcoder) was trained to mimic each layer's MLP input-to-output function to learn causality, CLT scaled this approach with n encoder-decoder pairs. CLT maintains the same encoder for each layer, but has decoders for all subsequent causal MLP outputs (including itself), capturing much more diverse cross-layer causality. Through this, with + encoders trained, it creates a half fully-connected graph, then combines various pruning techniques with correlation-based importance scores and metrics like TWERA and ERA to obtain the final Attribution Graph. Causality is then verified through patching, though the graph construction itself is not based on patching-derived causality.

Decoder sparsity loss
Building on Crosscoder's decoder sparsity loss, CLT uses tanh activation for decoders to achieve appropriate regularization: behaving linearly near 0 and saturating to 1 for larger values, which stabilizes training.

Seonglae Cho