Temporal Difference

Creator
Creator
Seonglae Cho
Created
Created
2023 Sep 10 7:57
Editor
Edited
Edited
2025 Jul 18 1:24
Refs
Refs

TD learning

Calculates the error (δ) as the difference between "predicted value" and "observed reward + next state value"
  • State‑value TD (V-learning) V(st+1)V(st)V(s_{t+1}) - V(s_t)
  • Action‑value TD (
    Q learning
    /
    SARSA
    ) Q(st+1,a)Q(st,at)Q(s_{t+1}, a') - Q(s_t,a_t)
  • Advantage Q(s,a)V(s)Q(s,a) - V(s) only isolate the action effect
“bootstrapped” estimation updates immediately at each step based on bootstrapping

Temporal difference target (TD target)

notion image
rti+Vϕπθ(st+1i)r_t^i + V_\phi^{\pi_\theta}(s_{t+1}^i)
 

Supervised regression (TD error)

L(ϕ)=12iVϕπθ(si)yi2L(\phi) = \frac{1}{2}\sum_i||V_\phi^{\pi_\theta}(s^i) - y_i||^2
 
 
 
 
 

Recommendations