Q function, policy 둘다 고려하기 어려워 No policy gradient dependency
- Collect by rolling-out any policy
- Set target and fit to for each data point
(TD error)
- without explicit update on which are defined using
Properties
- Only one Q network to learn, no high-variance policy gradient
- 하지만 하나 고정해두니 당연히 No convergence guarantees (Moving target) requires lots of tricks to make it work
- Evaluating Q-values for all possible actions is infeasible with continuous action space (Could be harder to learn than just a policy)
Value Iteration & Q Iteration
Q learnings
Q Learning Notion