TD learning
Calculates the error (δ) as the difference between "predicted value" and "observed reward + next state value"
- State‑value TD (V-learning) V(st+1)−V(st)
- Action‑value TD (Q learning/SARSA) Q(st+1,a′)−Q(st,at)
- Advantage Q(s,a)−V(s) only isolate the action effect
“bootstrapped” estimation updates immediately at each step based on bootstrapping
Temporal difference target (TD target)
rti+Vϕπθ(st+1i)
Supervised regression (TD error)
L(ϕ)=21∑i∣∣Vϕπθ(si)−yi∣∣2