Map situation to action by numeric reward signal for policy model
Unlike supervised & unsupervised iid, RL consider Compounding Error
어떤 환경 안에서 정의된 에이전트가 현재의 상태를 인식하여, 선택r 가능한 행동들 중 보상을 최대화
We get a world state by interacting with world (perception)
Expected return of a policy is the expected return over all possible trajectories
- Sequential decision making problems (sequential decision making is everywhere)
- Approach for learning decision making
Reinforcement Learning Notion
Reinforcement Learning Usages
OpenAI
CS285