YSU RL Midterm

Created

Created

2024 Mar 13 1:13

Creator

Creator

Seonglae Cho

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Oct 25 22:16

Refs

Refs

The exam comprises 25 sub-problems, including 8 multiple-choice questions, 10 short-answer questions, 4 simple calculation questions, and 3 proof questions.

He will ask about how to interpret the result and effect of implementation but don’t have to code

Time: 10:05-11:45am (100 minutes)

Location: The exam will be held in the same classroom as usual (D504)

Coverage: The midterm will encompass material covered in lectures until this Wednesday (April 17th) + HW1 + HW2 (so, no questions related to offline RL in this midterm)

Question types ("rough" distribution): Multiple-choice questions (~50%), short writing questions (~30%), proof/derivation/calculation questions (~20%)

Imitation Learning

Behavior Cloning

DAgger Expert policy가 필요한 단점

Policy Gradient Theorem*

Markov Decision Process

Gradient*

Reward to Go

Policy Gradient Baseline* if and are independent

expectation

notion image

variance

notion image

you cannot simply use state-action dependent baseline for unbiased policy gradient estimates.

data unbiased → baseline 빼도 그대로라 우측항 고정

Discount factor

Advantage function

Actor Critic

Reinforcement Learning Method

Importance sampling

GAE

PPO

larger GAE → larger larger and then larger

notion image

Bellman Expectation Equation

Value-Based Learning

Temporal Difference

DQN
RL Target Network to prevent moving target

RL Target Network
Double DQN

Value based actor critic

실제 시험

객관식 multiple answers

코드 문제 좀 나옴 특히 객관식 완성하기 hw 코드 채워넣기 loss부분

DDPG how to enable continuous from
DQN

target network, actor critic

Soft actor-critic (SAC), TD3 to prevent overestimation of q

TD3+BC for TD3
for SAC
Double Q learning (DDQN) for DQN

GAE implementation multiple answers

PPO loss how to compute logp in pytorch sum only without mean

PPO ratio equation

pi / pi (correct)
log pi / log pi (false)
(exp log pi / exp log pi) numerically unstable to devide probability directly
exp (log pi / log pi) (correct)

3 limitation

behavior cloning

imitation learning

write down dagger’s 4 steps

No offline RL questions in this midterm! 그냥 강의안 다 봐라

Recommendations

//////