Offline RL, model-based RL, reward learning, skill discovery, hierarchical RL, and HW3-5
과제 pdf랑 코드 꼭 보기
증명 안나옴
Offline Learning
- Offline dataset is given, Train policy on offline data; Offline RL reuses previously collected data
- Data collected from some unknown policy , called behavior policy or mixture of policies
- Offline RL finds the optimal path by combining segments, while BC only mimics the actions without optimizing.
- Offline RL need to handle unseen actions in a safe way, while doing better than data
- Can we simply use Off-policy algorithms? Q overestimation matters more because of static data. (small dataset size → larger over-estimation)
- Where to use offline RL? Pre-training for robots… We do offline RL to improve the policy beyond the behavior policy with policy improvement vs. distribution shift.
TD3+BC (Explicitly constrain to the data)
The solution is that keeping action to stay close to behavior policy since Q-function is unreliable on OOD actions. (keep close to unknown = fit to the data).
However, It is too pessimistic when is a random policy and It is not pessimistic enough when the is good. (too conservative)
CQL (Implicitly constrain to data by penalizing Q-values)
Thereby, we choose to train so that OOD actions never have high values.

Can show that for large enough (It might be too pessimistic)

No longer guaranteed that for all , but this guaranteed that for all .
The final loss contains regularizer to find with closed form solution. With max entropy regularizer, optimal proportional to . In discrete action space, we can compute and we sample actions to approximate in continuous action space.
Finally, we update Q function using with dataset and update policy using SAC (simply replacing with . In practice, tuning is crucial for performance with few hacks such as BC pre-training and estimating log-sum-exp.
Filtered Behavioral Cloning
Can we never estimate values on OOD actions by leveraging reward in behavioral cloning (Behavioral cloning with rewards). In other words, imitating only good trajectories is Filtered Behavioral Cloning (K% BC). Filter dataset to include top K% of data after ranking trajectories and before imitating.
AWR (Advantage-Weighted Regression)
Filtered BC treats transitions in a trajectory equally and uses reward only in binary way (use or not use). AWR weights each transition depending how good the action is with advantage function.

This avoids training on any OOD actions but it assumes weaker target policy.
Implicit Q Learning
SARSA style but use only good actions using loss.
SARSA-style objective can learn Q-values without estimating OOD actions with Expectile Regression for best samples by mapping prediction tends to map higher targets. Roughly, fits top percent of .
- Prediction is larger than target → small loss → prediction stays large
- prediction is smaller than target → larger loss → prediction becomes larger

- two hyperparameters and compared to CQL has 1 which means hard to tune.
- Once converged, it extracts using AWR (works very well in practice)
- Decoupling actor & critic training -> computationally fast
Model based RL
What if we know the model? Pick the best one from imaginary policy rollouts with infinite trajectories
Planning for inference and Trajectory data generation by and
Models are immensely useful if easy to learn and can be trained without reward labels (self-supervised). Model is somewhat task-agnostic and can sometimes be transferred across rewards.
Model-based optimization (gradient based & sampling based)
- CEM Iteration sampling based

- Augmenting data with model-simulated roll-outs. Sampling based optimization doesn’t scale to high-dimensions in terms of both horizon and action space.
Model-based Planning (gradient based & sampling based)
- Open-loop system Closed-loop system. More planning → better performance
- Brute-force tree search planning needs exponential complexity.
- Alpha Go
- approximates tree search with sampling (Estimate a value (expected return) with several simulations of entire games starting from current state)
- Tree Policy: Selection → Expansion
- Default Policy: Run a simulation using quick policy
- Update action values Q and visit counts of all traversed edges
- Efficient MCTS only search plausible states with reduced breadth (actions) and depth (result)
- New tree policy to reduce breadth with policy network


The Q encourages while the N encourages
Alpha zero do not MCTS rollout with only simulations. No human data, no domain knowledge only self-play. In other words, it does not go to the end for single policy consistency.
Model learning
Model-Predictive Control
CEM Iteration or random shooting can be used in step 3

It replans to correct for model errors but only practical for short-horizon problems since it is computationally expensive and not accurate for long plans.
TD-MPC
Planning short horizon with terminal value function can find local planning that maximizes global value, since MPC yields temporally locally optimal solutions and the value function approximates the globally optimal solution.
Estimating Q-targets via planning is very slow, hence we use policy instead.

TD-MPC = Latent dynamics + CEM-MPC w/ TD-learning. In other words, it use CEM-MPC for inference, plan in (task-oriented) latent space and train on maximizing return estimated by Q-value. In some sense, TD-MPC combines model-based planning (CEM-MPC) with model-free policy optimization (REPARAMETRIZE) and TD-learning (Q-learning).
TD-MPC solves challenging continuous control problems.
Dreamer
- Dyna (Online Q-learning with a model) style model-based RL

First time model-based RL outperforms model-free RL on pixels (Dreamer V1)
DreamerV2 on discrete envs, DreamerV3 on all evns and DayDreamer for physical robot learning.
Representation learning mostly relies on observation reconstruction which means i works well with high-dimensional, multi-modal observations but requires a high GPU memory usage and slow update due to large world models.


Reward model
Imitation learning mimics actions of expert without reasoning about outcomes or dynamics. The expert might have different degrees of freedom and it could not possible to provide demonstrations.
Goal classifier with success examples and general reward classifier with demonstration trajectories.
Goal classifier
The RL algorithm will seek out states that the classifier thinks are good. Can we prevent the RL algorithm from exploiting the classifier’s weaknesses? We update the classifier during RL, using policy data as negative.

Generator will match data distribution at convergence.

- Behavior Cloning: action | state
- GAIL: state action pair
- GAIfO: state only without expert action demo
Pairwise preferences are easy to provide without goals or demos.

Critique is easier than generation! but we still need supervision.
RL agent learns skills (options) without environment reward ()
Action entropy is not the same as state entropy and MaxEnt policies are stochastic, but not controllable. Diversity-promoting reward function could be obtained through discriminator.
Goal of skill policy = minimize = maximize .

Mutual information between skills and states can be maximized by maximizing that reward definition.

Maximizing MI does NOT encourage skills to cover diverse states and only focus on easy skills. Taking changes in states (distance) into account could help incentivize challenging behaviors. However, only small (but diverse) state changes can also maximize MI.
Euclidean distance-maximizing skill discovery may not learn static skills. Any distance can be used based on domain knowledge.

max 에서 차와 z를 곱하는 것으로 direction과 크기 모두를 고려하는 것을 확인할 수 있다. move as far as it can along direction z (align these 2)

The Euclidean distance does not necessarily correspond to the behaviors of our interests.
Controllability-aware skill discovery
Learn what are easy-to-control states and hard-to-control states by imposing more reward when changing hard-to-control states.

High-level policy makes high-level decision while Low-level policy makes low-level decision (Many design choices)
HRL helps solving long-horizon complex tasks with temporally extended exploration and simplified credit assignment.
Skill chaining
To chain more skills, we need to increase initiation set, while keeping termination set small.
- T-STAR
- Transition Policy
Skill dynamics model (learn skill prior - task policy)
skill dynamics model improve sample efficiency for long-horizon tasks
- SPiRL
- SkiMo
Homework
Homework 3 DQN
if np.random.random() < epsilon: action = torch.tensor([np.random.randint(self.num_actions)]) else: with torch.no_grad(): q_values = self.critic(observation) action = q_values.argmax(dim=1)
# Compute target values with torch.no_grad(): # HINT: first, compute q(s',a') values for all actions with the target network # (we will later find the maximum values among all possible actions) next_qa_values = self.target_critic(next_obs) if self.use_double_q: # HINT: find the best action by evaluting q(s',a') with the current critic next_action = self.critic(next_obs).argmax(dim=1, keepdim=True) else: # HINT: find the next action to choose (hint: argmax) among `next_qa_values` next_action = next_qa_values.argmax(dim=1, keepdim=True) # HINT: find the next q values among `next_qa_values` by specifying with `next_action` next_q_values = next_qa_values.gather(1, next_action).squeeze() target_values = reward + self.discount * (~done) * next_q_values qa_values = self.critic(obs) q_values = qa_values.gather(1, action.unsqueeze(1)).squeeze() assert q_values.shape == target_values.shape # HINT: compute loss with q_values and target_values loss = self.critic_loss(q_values, target_values.detach()) self.critic_optimizer.zero_grad() loss.backward() self.critic_optimizer.step() self.lr_scheduler.step()
- compute loss
- optimizer zero grad
- loss backward
- step optimizer
- step scheduler
Homework 4
if self.target_critic_backup_type == "doubleq": indices = torch.arange(num_critic_networks).to(ptu.device) double_q_values = torch.where(indices.unsqueeze(1) % 2 == 0, next_qs[1], next_qs[0]) return double_q_values elif self.target_critic_backup_type == "min": min_qs, _ = torch.min(next_qs, dim=0, keepdim=True) return min_qs.expand_as(next_qs)
def entropy(self, action_distribution: torch.distributions.Distribution): action = action_distribution.rsample() entropy = -action_distribution.log_prob(action) assert entropy.shape == action.shape[:-1] return entropy
next_action_entropy = self.entropy(next_action_distribution) next_qs = next_qs.clone() + self.temperature * next_action_entropy.unsqueeze(0) # TODO(student): Compute the target Q-value target_values: torch.Tensor = reward + self.discount * ~done * next_qs assert target_values.shape == ( self.num_critic_networks, batch_size ) # TODO(student): Predict Q-values using `self.critic` q_values = self.critic(obs, action) assert q_values.shape == (self.num_critic_networks, batch_size), q_values.shape loss: torch.Tensor = self.critic_loss(q_values, target_values) self.critic_optimizer.zero_grad() loss.backward() self.critic_optimizer.step()
obs_expanded = obs.unsqueeze(0).expand(self.num_actor_samples, *obs.shape).reshape(-1, *self.observation_shape) action_expanded = action.unsqueeze(0).reshape(-1, self.action_dim) q_values = self.critic(obs_expanded, action_expanded).view(self.num_critic_networks, self.num_actor_samples, batch_size) assert q_values.shape == ( self.num_critic_networks, self.num_actor_samples, batch_size, ), q_values.shape # Our best guess of the Q-values is the mean of the ensemble q_values = torch.mean(q_values, axis=0) # Do REINFORCE (without baseline) # TODO(student): Calculate log-probs log_probs = action_distribution.log_prob(action) assert log_probs.shape == q_values.shape # TODO(student): Compute policy gradient using log-probs and Q-values loss = -(log_probs * q_values).mean()
def actor_loss_reparametrize(self, obs: torch.Tensor): batch_size = obs.shape[0] # Sample from the actor action_distribution: torch.distributions.Distribution = self.actor(obs) # TODO(student): Sample actions # Note: Think about whether to use .rsample() or .sample() here... action = action_distribution.rsample() # TODO(student): Compute Q-values for the sampled state-action pair q_values = self.critic(obs, action).mean(0) # TODO(student): Compute the actor loss using Q-values loss = -q_values.mean()
Homework 5 CQL
def dqn_loss(self, ob_no, ac_na, next_ob_no, reward_n, terminal_n): qa_t_values = self.q_net(ob_no) q_t_values = torch.gather(qa_t_values, 1, ac_na.unsqueeze(1)).squeeze(1) qa_tp1_values = self.q_net_target(next_ob_no) next_actions = self.q_net(next_ob_no).argmax(dim=1) q_tp1 = torch.gather(qa_tp1_values, 1, next_actions.unsqueeze(1)).squeeze(1) target = reward_n + self.gamma * q_tp1 * (1 - terminal_n) target = target.detach() loss = self.loss(q_t_values, target) return loss, qa_t_values, q_t_values
# TODO: CQL Implementation (Equation 1) # Hint #1: Obtain dqn_loss, qa_t_values, q_t_values using self.dqn_loss (the first summation in Equation 1) # Hint #2: Compute q_t_logsumexp in cql regularizer using torch.logsumexp # Hint #3: Compute cql_loss using q_t_logsumexp and q_t_values (the second summation in Equation 1) # Hint #4: Finally, compute loss using dqn_loss and cql_loss dqn_loss, qa_t_values, q_t_values = self.dqn_loss(ob_no, ac_na, next_ob_no, reward_n, terminal_n) q_t_logsumexp = torch.logsumexp(qa_t_values, dim=1) cql_loss = self.cql_alpha * (q_t_logsumexp - q_t_values).mean() loss = dqn_loss + cql_loss
q_t_values실제로 취한 행동 D
qa_t_values모든 가능한 행동 a (logsumexp)
- 실제 한 행동 overestimate 하는데 loss에서 빼주는 이유는 반대이기 때문
시험
- loss 랑 엔트로피 등
- skill learning 사용할 때 문제점 gail
- IQL ex~ 적용위치
- CQL ood query term
- skill dynamics한계와 chaining 2 일때와 3 일때
Seonglae Cho