Soft actor-critic (SAC)

Using stochastic policy unlike
DDPG

A central feature of SAC is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy. (Actually deterministic policy is 0 variance version of stochastic policy)

최대화하려는 목표에 entropy bonus 항을 추가해서 with temperature hyperparameter

\alpha

which are effective for continuous action space

Epsilon Greedy for discrete, In continuous case, the Entropy means How random the policy is.

H(\pi) = E_{\tau \sim \pi(\tau)}[-log\pi(a_t|s_t)]

Radom policy is preferred in RL because it enables more

RL Exploration

Soft Q value

with temperature parameter for entropy

Automatically tuning temperature hyperparameter $\alpha$

Max Ent RL objective is

E_{\tau \sim \pi_\theta (\tau)}[\sum_t R(s_t, a_t) + \alpha H(\pi(\cdot | s_t))]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement...

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major...

https://arxiv.org/abs/1801.01290

Soft Actor-Critic — Spinning Up documentation

Soft Actor Critic (SAC) is an algorithm that optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization and DDPG-style approaches. It isn’t a direct successor to TD3 (having been published roughly concurrently), but it incorporates the clipped double-Q trick, and due to the inherent stochasticity of the policy in SAC, it also winds up benefiting from something like target policy smoothing.

https://spinningup.openai.com/en/latest/algorithms/sac.html

Soft actor-critic (SAC)

Using stochastic policy unlike
DDPG

Soft Q value

Automatically tuning temperature hyperparameter $\alpha$

Backlinks

Recommendations

Soft actor-critic (SAC)

Using stochastic policy unlike DDPG

Soft Q value

Automatically tuning temperature hyperparameter α\alphaα

Backlinks

Recommendations

Using stochastic policy unlike
DDPG

Automatically tuning temperature hyperparameter $\alpha$