The objective in maximum entropy RL is to maximize the expected return while also maximizing the entropy of the policy. This encourages exploration and helps prevent the policy from becoming too deterministic too quickly.
the optimal policy is proportional to because the maximum entropy regularizer encourages a policy that not only seeks high reward.
o find the optimal policy , we need to maximize . This can be done by taking the derivative with respect to and setting it to zero.
Log-sum-exp trick is common in machine learning for numerical stability. Estimating this term involves computing a log of the sum of exponentials of Q-values, which can be computationally intensive, especially in continuous action spaces.
To turn this proportionality into a valid probability distribution, we normalize.
Often, for simplicity, is set to 1 or absorbed into the Q-function.
Discrete Action Space
In a discrete action space, the computation is straightforward:
- Compute the exponentials of the Q-values for each action.
- Sum these exponentials.
- Take the logarithm of this sum.
Continuous Action Space
In a continuous action space, exact computation is infeasible, so we use sampling methods to approximate it. Here’s how:
- Sample Actions: Draw a set of sample actions {a1,a2,...,an} from the action space, typically using a proposal distribution.
{a1,a2,...,an}\{a_1, a_2, ..., a_n\}
- Evaluate Q-Values: Calculate the Q-values for these sampled actions.
- Compute the Approximation: Use the sampled Q-values to estimate the log-sum-exp.