Do not consider T and R → simplified task - just evaluate policy by state value
So goal is compute value of each state under policy (input is policy → (episodes training by observation) → output is value per state)
direct evaluation - just sum of after state's value and divide by episode number
Sample-based policy evaluation - improve V per transition by estimate T and R by sample of outcome
Optimal value function means high Q value Q value is prediction of all future R by action V value is prediction of all future R in state Policy is action per state which maximize V Alpha means learning rate
So most reinforce learning is Q - MDP is not used - just for understanding