Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/
Diversity Hypothesis
Search

Diversity Hypothesis

Creator
Creator
Seonglae Cho
Created
Created
2025 Feb 4 11:5
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Feb 27 20:57
Refs
Refs
Grokking
Model Generalization
Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).
Decision Transformer Interpretability — LessWrong
TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circui…
Decision Transformer Interpretability — LessWrong
https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
Decision Transformer Interpretability — LessWrong
RL Vision Interpretability
Understanding RL Vision
With diverse environments, we can analyze, diagnose and edit deep reinforcement learning models using attribution.
https://distill.pub/2020/understanding-rl-vision/
Understanding RL Vision
 
 
 
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/
Diversity Hypothesis
Copyright Seonglae Cho