Using stochastic policy unlike DDPG
A central feature of SAC is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy. (Actually deterministic policy is 0 variance version of stochastic policy)
최대화하려는 목표에 entropy bonus 항을 추가해서 with temperature hyperparameter which are effective for continuous action space
Epsilon Greedy for discrete, In continuous case, the Entropy means How random the policy is.
Radom policy is preferred in RL because it enables more RL Exploration
Soft Q value
with temperature parameter for entropy
Automatically tuning temperature hyperparameter
Max Ent RL objective is