Q learning

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 27 10:1
Editor
Edited
Edited
2024 Apr 27 16:21
Refs
Refs

Q function, policy 둘다 고려하기 어려워 No policy gradient dependency

notion image
  1. Collect {si,ai,ri}\{s_i, a_i, r_i\} by rolling-out any policy
  1. Set target yi=ri+maxaiQϕ(si,ai)y_i = r_i + max_{a_i'}Q_\phi(s_i', a_i') and fit Qϕ(si,ai)Q_\phi(s_i,a_i) to yiy_i for each data point
    1. Qϕ(s,a)r(s,a)+maxaQϕ(s,a)Q_\phi (s,a) \leftarrow r(s,a) + max_{a'}Q_\phi(s', a') \therefore
      ϕargminϕiQϕ(si,ai)yi2\phi \leftarrow argmin_\phi \sum_i ||Q_\phi (s_i, a_i) - y_i||^2 (TD error)
  1. π(s)=arg maxaQπ(s,a)\pi'(s) = \argmax_aQ^\pi(s, a) without explicit update on π\pi which are defined using QϕQ_\phi
    1. notion image

Properties

  • Only one Q network to learn, no high-variance policy gradient
  • 하지만 하나 고정해두니 당연히 No convergence guarantees (Moving target) requires lots of tricks to make it work
  • Evaluating Q-values for all possible actions is infeasible with continuous action space (Could be harder to learn than just a policy)

Value Iteration
& Q Iteration

Model based RL
p(sisi,ai)p(s_i'| s_i, a_i)
notion image
Q learnings
 
 
 
Q Learning Notion
 
 
 
 
 
 
 

Recommendations