TD learning“bootstrapped” estimationTemporal difference target (TD target)rti+Vϕπθ(st+1i)r_t^i + V_\phi^{\pi_\theta}(s_{t+1}^i)rti+Vϕπθ(st+1i) Supervised regression (TD error)L(ϕ)=12∑i∣∣Vϕπθ(si)−yi∣∣2L(\phi) = \frac{1}{2}\sum_i||V_\phi^{\pi_\theta}(s^i) - y_i||^2L(ϕ)=21∑i∣∣Vϕπθ(si)−yi∣∣2