Map situation to action by numeric reward signal for policy model
Approach for Sequential decision making problems (sequential decision making is everywhere)
Reinforcement Learning is process of resolving MDP
Unlike supervised and unsupervised assume iid that all data points does not exchange one point to another, RL considers Compounding Error with Markov Chain changing the distribution at different parts of the sequence. RL addresses compounding error at each time step by considering distribution shift. However, SL considers distribution shift not for each inference, even with time-series data, but instead accounts for it indirectly within the model.
In actual training, the data structure is the same, but rewards are derived from the environment rather than from the ground truth data in supervised learning. The practical difference between supervised learning and reinforcement learning lies in whether there is interaction with the environment, whether it's offline or online, through model-based approaches.
Indirect supervision: an agent defined within a certain environment recognizes the current state and maximizes rewards among selectable actions.
One approach to reinforcement learning involves generative and discriminative models, such as GAN. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with Verifiable Reward, suggesting that larger units like code blocks or video clips should be gradually incorporated.
Dataset for AI are three types
- Background information - Pretraining
- Problems with solution - SFT
- Practice problems - Reinforcement Learning
Reinforcement Learning Notion
Reinforcement Learning Usages
OpenAI
Welcome to Spinning Up in Deep RL! — Spinning Up documentation
© Copyright 2018, OpenAI.
Revision 038665d6.
https://spinningup.openai.com/en/latest/index.html
CS285
CS 285
Lectures will be recorded and provided before the lecture slot. The lecture slot will consist of discussions on the course content covered in the lecture videos.
https://rail.eecs.berkeley.edu/deeprlcourse-fa20/
강화학습 강의 (CS234) 7강 - Imitation Learning / Inverse RL
- 본 포스팅은 CS234 7강의 내용을 정리합니다. * 강의 앞부분에 DQN을 정리하는 부분이 있는데, 그 부분은 그냥 빼고 설명하겠습니다. 오늘 배울 것들로는, Behavioral Cloning, Inverse Reinforcement Learning, Apprenticeship Learning 등이 있습니다. 지금까지 우리는 Optimization과 Generalization에 대해서 배웠었다. 어떻게 최적의 policy를 찾아가는지, 그리고 어떻게 그것을 일반화 시킬지에 대한 이야기를 많이 했었는데, 이번에 볼 것은 바로 Efficiency, 즉 효율성이다. 컴퓨팅 파워를 많이 사용하지 않고 최적의 policy를 찾는 방법에 대해서 알아볼 것이다. 일반적인 MDP에서는, 좋은 policy를 찾기 ..
https://cding.tistory.com/71
What is Reinforcement Learning · Fundamental of Reinforcement Learning
https://dnddnjs.gitbooks.io/rl/content/what_is_reinforcement_learning.html

Seong-lae Cho