Maximum Entropy Objective

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Jun 16 11:10
Editor
Edited
Edited
2024 Jun 16 11:16
Refs
Refs
The objective in maximum entropy RL is to maximize the expected return while also maximizing the entropy of the policy. This encourages exploration and helps prevent the policy from becoming too deterministic too quickly.
the optimal policy is proportional to because the maximum entropy regularizer encourages a policy that not only seeks high reward.
notion image
o find the optimal policy , we need to maximize . This can be done by taking the derivative with respect to and setting it to zero.
notion image
Log-sum-exp trick is common in machine learning for numerical stability. Estimating this term involves computing a log of the sum of exponentials of Q-values, which can be computationally intensive, especially in continuous action spaces.
notion image
To turn this proportionality into a valid probability distribution, we normalize.
notion image
Often, for simplicity, is set to 1 or absorbed into the Q-function.
notion image
 
 
 

Discrete Action Space

In a discrete action space, the computation is straightforward:
  1. Compute the exponentials of the Q-values for each action.
  1. Sum these exponentials.
  1. Take the logarithm of this sum.

Continuous Action Space

In a continuous action space, exact computation is infeasible, so we use sampling methods to approximate it. Here’s how:
  1. Sample Actions: Draw a set of sample actions {a1,a2,...,an} from the action space, typically using a proposal distribution.
    1. {a1,a2,...,an}\{a_1, a_2, ..., a_n\}
  1. Evaluate Q-Values: Calculate the Q-values for these sampled actions.
  1. Compute the Approximation: Use the sampled Q-values to estimate the log-sum-exp.
 
 
 
 
 
 

Recommendations