RL tasks can be solved with transformer sequence modelling
Tokenizes states, actions, and rewards but emphasizes accurately predicting future actions from these sequences
- Compute Reward to Go and train trajectory
- Return conditioned policy can be used for policy rollout
- desired return
- subtract reward
Works well for long horizon and sparse task compared to traditional RL
Does success of Mar false approach
- More supervision and MDP can be differ per task
NeuralPS 2021