Passive reinforcement learning

Creator
Created
Created
2019 Nov 5 5:18
Editor
Edited
Edited
2024 Apr 4 13:55
Refs
Refs
Do not consider T and R → simplified task - just evaluate policy by state value
So goal is compute value of each state under policy (input is policy → (episodes training by observation) → output is value per state)
direct evaluation - just sum of after state's value and divide by episode number
Sample-based policy evaluation - improve V per transition by estimate T and R by sample of outcome
 
Optimal value function means high Q value Q value is prediction of all future R by action V value is prediction of all future R in state Policy is action per state which maximize V Alpha means learning rate
notion image
notion image
So most reinforce learning is Q - MDP is not used - just for understanding
 
 

Recommendations