Can be considered as an adaptive learning rate without smoothing
To prevent gradient explosion, values are clipped to not exceed a threshold. When FP overflow occurs, values are typically clipped according to their norm.
Trace anomaly
with torch.autograd.set_detect_anomaly(True):
loss.backward()