Feature Cluster Resampling

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 23 21:27
Editor
Edited
Edited
2025 Mar 23 21:54
Refs
Refs

SAE Debugger

  1. Select Text Parts
  1. Remove Unnecessary Features: Apply KL divergence ablation at each layer to remove SAE latent features that have minimal impact on the model's output.
  1. Apply Clustering: Use
    K-nearest Neighbor
    clustering on the remaining latent vectors to group similar features.
  1. Resample and Compute Edge Weights: For each feature, resample the ablated latents from the corresponding cluster and compute the MSE to assign weights to the edges between nodes. This involves taking latent values from similar data points within the cluster to estimate the ablated feature.A low MSE indicates that the feature's influence can be substituted by other similar features within the cluster, while a high MSE suggests that the feature plays a unique and important role.
  1. Visualize the Circuit: Construct a visual representation of the resulting circuit, showing nodes (key features) and edges (their causal relationships) for debugging and analysis purposes.The circuit shows which latent features play major roles within the model and what causal connections (influence relationships) exist between them.
 
 
 
 
 
 

Recommendations