RL tasks can be solved with transformer sequence modelling
Tokenizes states, actions, and rewards but emphasizes accurately predicting future actions from these sequences
- Compute Reward to Go and train trajectory
- Return conditioned policy can be used for policy rollout
- desired return
- subtract reward

Works well for long horizon and sparse task compared to traditional RL
Does success of Mar false approach
- More supervision and MDP can be differ per task
NeuralPS 2021
Decision Transformer: Reinforcement Learning via Sequence Modeling
We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in…
https://ar5iv.labs.arxiv.org/html/2106.01345

synthetic data training

Seonglae Cho