Basic idea - randomaction → check reward → observation given state → learn → make policy
Receive feedback in the form of rewards
Agent’s utility is defined by the reward function
Must (learn to) act so as to maximize expected rewards
All learning is based on observed samples of outcomes
States
environment state - environment representation
agent state - agent representation - most used state
information state (Markov state) - probability from start to this state = probability from previous state to this state → state is Markov (independant)
Others
history - sequence of observation, action, reward
value iteration
policy iteration
MDP - policy and value
Both value iteration and policy iteration compute the same thing (all optimal values)