TD learning
Calculates the error (δ) as the difference between "predicted value" and "observed reward + next state value"
- State‑value TD (V-learning)
- Action‑value TD (Q learning/SARSA)
- Advantage only isolate the action effect
“bootstrapped” estimation updates immediately at each step based on bootstrapping
Temporal difference target (TD target)
