SAE Debugger
- Select Text Parts
- Remove Unnecessary Features: Apply KL divergence ablation at each layer to remove SAE latent features that have minimal impact on the model's output.
- Apply Clustering: Use K-nearest Neighbor clustering on the remaining latent vectors to group similar features.
- Resample and Compute Edge Weights: For each feature, resample the ablated latents from the corresponding cluster and compute the MSE to assign weights to the edges between nodes. This involves taking latent values from similar data points within the cluster to estimate the ablated feature.A low MSE indicates that the feature's influence can be substituted by other similar features within the cluster, while a high MSE suggests that the feature plays a unique and important role.
- Visualize the Circuit: Construct a visual representation of the resulting circuit, showing nodes (key features) and edges (their causal relationships) for debugging and analysis purposes.The circuit shows which latent features play major roles within the model and what causal connections (influence relationships) exist between them.