Transition Policy

Creator
Creator
Seonglae Cho
Created
Created
2024 Jun 18 14:54
Editor
Edited
Edited
2024 Jun 18 14:54
Refs
Refs
Naive fine-tuning of skills fails since the skills end up with very large terminal state distributions.
A reward for learning a transition policy is a success of following policy. A proximity prediction for proximity reward instead of binary reward estimates better on how close to good initial states. The paper defines proximity as P(s)=δstepP(s) = \delta^{step} and δ(0,1)\delta \in (0,1) like discounting factor and provide proximity reward every step P(st+1)P(st)P(s_{t+1}) - P(s_t). The paper learns proximity predictor by success buffer and failure buffer.
 
 
 
 
 
 
 
 
 

Recommendations