Passive reinforcement learning

Do not consider T and R → simplified task - just evaluate policy by state value

So goal is compute value of each state under policy (input is policy → (episodes training by observation) → output is value per state)

direct evaluation - just sum of after state's value and divide by episode number

Sample-based policy evaluation - improve V per transition by estimate T and R by sample of outcome

Optimal value function means high Q value Q value is prediction of all future R by action V value is prediction of all future R in state Policy is action per state which maximize V Alpha means learning rate

So most reinforce learning is Q - MDP is not used - just for understanding

Passive reinforcement learning

Recommendations