The exam comprises 25 sub-problems, including 8 multiple-choice questions, 10 short-answer questions, 4 simple calculation questions, and 3 proof questions.
He will ask about how to interpret the result and effect of implementation but don’t have to code
- Time: 10:05-11:45am (100 minutes)
- Location: The exam will be held in the same classroom as usual (D504)
- Coverage: The midterm will encompass material covered in lectures until this Wednesday (April 17th) + HW1 + HW2 (so, no questions related to offline RL in this midterm)
- Question types ("rough" distribution): Multiple-choice questions (~50%), short writing questions (~30%), proof/derivation/calculation questions (~20%)
Imitation Learning
- DAgger Expert policy가 필요한 단점
Policy Gradient Theorem*
- Policy Gradient Baseline* if and are independent
expectation

variance

you cannot simply use state-action dependent baseline for unbiased policy gradient estimates.
data unbiased → baseline 빼도 그대로라 우측항 고정
Actor Critic
- PPO
- larger GAE → larger larger and then larger

Value-Based Learning
- DQN RL Target Network to prevent moving target
- RL Target Network
- Double DQN
- Value based actor critic
실제 시험
객관식 multiple answers
- 코드 문제 좀 나옴 특히 객관식 완성하기 hw 코드 채워넣기 loss부분
- Soft actor-critic (SAC), TD3 to prevent overestimation of q
- TD3+BC for TD3
- for SAC
- Double Q learning (DDQN) for DQN
- GAE implementation multiple answers
- PPO loss how to compute logp in pytorch sum only without mean
- PPO ratio equation
- pi / pi (correct)
- log pi / log pi (false)
- (exp log pi / exp log pi) numerically unstable to devide probability directly
- exp (log pi / log pi) (correct)
- 3 limitation
- behavior cloning
- imitation learning
- write down dagger’s 4 steps
No offline RL questions in this midterm! 그냥 강의안 다 봐라
Seonglae Cho