Option 1: Distill planner’s actions into a policy
No longer compute intensive at test time but still limited to short-horizon problems
Option 2: Plan with terminal value function
Option 3: Augment model-free RL methods with data from model
When model generate full trajectories from initial states, model may not be accurate for long horizons. Also generate partial trajectories from initial states may not get good coverage of later states. Then how about augment data by generating partial trajectories from all states in the data.