Reinforcement Learning

Creator
Created
Created
2019 Nov 5 5:18
Editor
Edited
Edited
2025 Jun 21 18:19

Map situation to action by numeric reward signal for policy model π\pi

Approach for Sequential decision making problems (sequential decision making is everywhere)
Reinforcement Learning is process of resolving MDP
Unlike supervised and unsupervised assume
iid
that all data points does not exchange one point to another, RL considers
Compounding Error
with
Markov Chain
changing the distribution at different parts of the sequence. RL addresses compounding error at each time step by considering distribution shift. However, SL considers distribution shift not for each inference, even with time-series data, but instead accounts for it indirectly within the model.
In actual training, the data structure is the same, but rewards are derived from the environment rather than from the ground truth data in supervised learning. The practical difference between supervised learning and reinforcement learning lies in whether there is interaction with the environment, whether it's offline or online, through model-based approaches.
π(as,θ),maxri π(a|s, θ) , max \sum r_i
Indirect supervision: an agent defined within a certain environment recognizes the current state and maximizes rewards among selectable actions.
One approach to reinforcement learning involves generative and discriminative models, such as
GAN
. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with
Verifiable Reward
, suggesting that larger units like code blocks or video clips should be gradually incorporated.

Dataset for AI are three types

  • Problems with solution -
    SFT
Reinforcement Learning Notion
 
 
Reinforcement Learning Usages
 
 
 

OpenAI

CS285
 
 

Recommendations