Skill should be emergent by induced AI Incentive
RL agent learns Skills (Options) without env reward
- Information Theoretic discovery
- Plot distribution over observations
- Compute Entropy (How random, how broad it is)
Maximum entropy RL
Knowing only one solution can easily fall into local optima and is not robust to environmental changes. However, action entropy is not the same as state entropy (RL Exploration). Diverse actions do not guarantees diverse states.
We can lower diversity for a fixed skill , high diversity across options for controllable agent. (different skill should visit different state-action space)
Skill policy
Based on skill vector, the skill policy aligned to visit desired state.
Discriminator
Goal of skill policy is minimize which means maximize by setting
- The goal of the skill policy is to minimize , which means minimizing the distance between the state embedding and skill
- This objective is equivalent to maximizing , which maximizes the probability of skill given state
- To achieve this, we set the reward function as . This way, when the agent selects skill in state , it receives the log probability value as a reward
LSD adds a distance consideration to learning skill policy and by adding a term to maximize : and regulates to reflect distance in : preventing from becoming infinitely large.
Simply put, increasing the probability of assigning a skill to a state means reducing the probability of other skills, which effectively separates states and skills. This is the same as maximizing Mutual information.
However, distance-based skill discovery for RL Exploration has issues with Stop button problem, Waluigi Effect, and Instrumental Convergence
Skill discovery Methods
Discovering distinct skills by maximizing
- Multiple ways to approximate MI
- Only small state change can maximize MI
- Any distance can be used to improve exploration
- May not learn static skills because distance factor encourage more and more

Seonglae Cho