Structure
- Agent, Environment, Action, State, Reward
Basic idea - random action → check reward → observation given state → learn → make policy
- Receive feedback in the form of rewards
- Agent’s utility is defined by the reward function
- Must (learn to) act so as to maximize expected rewards
- All learning is based on observed samples of outcomes
Reinforcement Learning Terms
States
- environment state - environment representation
- agent state - agent representation - most used state
- information state (Markov state) - probability from start to this state = probability from previous state to this state → state is Markov (independant)
Others
- history - sequence of observation, action, reward
- value iteration
- policy iteration
- MDP - policy and value
Both value iteration and policy iteration compute the same thing (all optimal values)