Skill discovery

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 18 9:31
Editor
Edited
Edited
2025 Dec 16 13:7

Skill should be emergent by induced
AI Incentive

RL agent learns Skills (Options) without env reward

  • Information Theoretic discovery
    • Plot distribution over observations
    • Compute Entropy (How random, how broad it is)

Maximum entropy RL

Knowing only one solution can easily fall into local optima and is not robust to environmental changes. However, action entropy is not the same as state entropy (
RL Exploration
). Diverse actions do not guarantees diverse states.
We can lower diversity for a fixed skill , high diversity across options for controllable agent. (different skill should visit different state-action space)

Skill policy

Based on skill vector, the skill policy aligned to visit desired state.

Discriminator

Goal of skill policy is minimize which means maximize by setting
  1. The goal of the skill policy is to minimize , which means minimizing the distance between the state embedding and skill
  1. This objective is equivalent to maximizing , which maximizes the probability of skill given state
  1. To achieve this, we set the reward function as . This way, when the agent selects skill in state , it receives the log probability value as a reward
LSD adds a distance consideration to learning skill policy and by adding a term to maximize : and regulates to reflect distance in : preventing from becoming infinitely large.
Simply put, increasing the probability of assigning a skill to a state means reducing the probability of other skills, which effectively separates states and skills. This is the same as maximizing
Mutual information
.
However, distance-based skill discovery for
RL Exploration
has issues with
Stop button problem
,
Waluigi Effect
, and
Instrumental Convergence
Skill discovery Methods
 

Discovering distinct skills by maximizing

  • Multiple ways to approximate MI
  • Only small state change can maximize MI
  • Any distance can be used to improve exploration
  • May not learn static skills because distance factor encourage more and more
 
 
 
 
 

 

Recommendations