Skill discovery

RL agent learns Skills (Options) without env reward

Information Theoretic discovery

Plot distribution over observations
Compute Entropy (How random, how broad it is) $H(p(x)) = -E_{x\sim p(x)} [\log p(x)]$

Maximum entropy RL

Knowing only one solution can easily fall into local optima and is not robust to environmental changes. However, action entropy is not the same as state entropy (

RL Exploration). Diverse actions do not guarantees diverse states.

We can lower diversity for a fixed skill

z

, high diversity across options for controllable agent. (different skill should visit different state-action space)

Skill policy $\pi(a | s,z)$

Based on skill vector, the skill policy aligned to visit desired state.

Discriminator $\hat p_\theta(z|s')$

Goal of skill policy is minimize

||\phi(s') - z||

which means maximize

p(z | \phi(s'))

by setting

r(s'. z) = \log p(z | s')

스킬 정책의 목표는 ∣∣𝜙(𝑠′)−𝑧∣∣를 최소화하는 것으로, 이 상태 𝑠′의 임베딩 𝜙(𝑠′)와 스킬 𝑧 간의 거리를 최소화

이 목표는 𝑝(𝑧∣𝜙(𝑠′))를 최대화하는 것으로 이는 상태 𝑠′가 주어졌을 때 스킬 𝑧의 확률을 최대화

이를 위해 보상 함수를 𝑟(𝑠′,𝑧)=log⁡𝑝(𝑧∣𝑠′)로 설정합니다. 이렇게 하면 에이전트는 상태 𝑠′에서 스킬 𝑧를 선택할 때, 해당 스킬의 확률 로그 값을 보상으로 받음

LSD는 learning skill policy and

\phi(s)

에 distance 고려를 추가하기 위해 to maximize

\cdot z

항을 추가한다

(\phi(s') - \phi(s)) \cdot z

and regulate

\phi(s)

to reflect distance in

s

|| \phi (s') - \phi(s)||\le||s' - s||

preventing

\phi(s)

becoming infinitely large.

쉽게말해 state당 skill 할당 확률을 높인다는 말은 다른 스킬의 확률을 줄인다는 말로, state와 skill을 엮어서 잘 분리시킨다고 보면 된다. which is same as maximizing

Mutual information.

다만 distance based skill discovery의

RL Exploration 은

Stop button problem,

Waluigi Effect

Instrumental Convergence

Skill discovery Methods

METRA

CSD

DIAYN

Lipschitz-constrained Skill Discovery

DADS

CIC

Discovering distinct skills by maximizing

Multiple ways to approximate MI

Only small state change can maximize MI

Any distance can be used to improve exploration

May not learn static skills because distance factor encourage more and more

Skill discovery

RL agent learns Skills (Options) without env reward

Maximum entropy RL

Skill policy $\pi(a | s,z)$

Discriminator $\hat p_\theta(z|s')$

Discovering distinct skills by maximizing

Backlinks

Recommendations

Skill discovery

RL agent learns Skills (Options) without env reward

Maximum entropy RL

Skill policy π(a∣s,z)\pi(a | s,z)π(a∣s,z)

Discriminator p^θ(z∣s′)\hat p_\theta(z|s')p^​θ​(z∣s′)

Discovering distinct skills by maximizing

Backlinks

Recommendations

Skill policy $\pi(a | s,z)$

Discriminator $\hat p_\theta(z|s')$