YSU RL Final

Offline RL, model-based RL, reward learning, skill discovery, hierarchical RL, and HW3-5

과제 pdf랑 코드 꼭 보기

증명 안나옴

Offline Learning

Off-policy,
On-policy

Offline dataset is given, Train policy on offline data; Offline RL reuses previously collected data

Data collected from some unknown policy , called behavior policy or mixture of policies

Offline RL finds the optimal path by combining segments, while BC only mimics the actions without optimizing.

Offline RL need to handle unseen actions in a safe way, while doing better than data

Can we simply use
Off-policy algorithms?
Q overestimation matters more because of static data. (small dataset size → larger over-estimation)

Where to use offline RL? Pre-training for robots… We do offline RL to improve the policy beyond the behavior policy with policy improvement vs. distribution shift.

TD3+BC (Explicitly constrain to the data)

The solution is that keeping action to stay close to behavior policy since Q-function is unreliable on OOD actions. (keep close to unknown = fit to the data).

However, It is too pessimistic when is a random policy and It is not pessimistic enough when the is good. (too conservative)

CQL (Implicitly constrain to data by penalizing Q-values)

Thereby, we choose to train so that OOD actions never have high values.

Can show that for large enough (It might be too pessimistic)

No longer guaranteed that for all , but this guaranteed that for all .

The final loss contains regularizer to find with closed form solution. With max entropy regularizer, optimal proportional to . In discrete action space, we can compute and we sample actions to approximate in continuous action space.

Finally, we update Q function using with dataset and update policy using SAC (simply replacing with . In practice, tuning is crucial for performance with few hacks such as BC pre-training and estimating log-sum-exp.

Filtered Behavioral Cloning

Can we never estimate values on OOD actions by leveraging reward in behavioral cloning (Behavioral cloning with rewards). In other words, imitating only good trajectories is Filtered Behavioral Cloning (K% BC). Filter dataset to include top K% of data after ranking trajectories and before imitating.

AWR (Advantage-Weighted Regression)

Filtered BC treats transitions in a trajectory equally and uses reward only in binary way (use or not use). AWR weights each transition depending how good the action is with advantage function.

This avoids training on any OOD actions but it assumes weaker target policy.

Implicit Q Learning

SARSA style but use only good actions using loss.

SARSA-style objective can learn Q-values without estimating OOD actions with

Expectile Regression for best samples by mapping prediction tends to map higher targets. Roughly, fits top percent of .

Prediction is larger than target → small loss → prediction stays large

prediction is smaller than target → larger loss → prediction becomes larger

two hyperparameters and compared to
CQL has 1 which means hard to tune.

Once converged, it extracts using
AWR (works very well in practice)

Decoupling actor & critic training -> computationally fast

Model based RL

What if we know the model? Pick the best one from imaginary policy rollouts with infinite trajectories

Planning for inference and Trajectory data generation by and

Models are immensely useful if easy to learn and can be trained without reward labels (self-supervised). Model is somewhat task-agnostic and can sometimes be transferred across rewards.

Model-based optimization (gradient based & sampling based)

CEM Iteration sampling based

Augmenting data with model-simulated roll-outs. Sampling based optimization doesn’t scale to high-dimensions in terms of both horizon and action space.

Model-based Planning (gradient based & sampling based)

Open-loop system
Closed-loop system. More planning → better performance

Brute-force tree search planning needs exponential complexity.

Alpha Go

approximates tree search with sampling (Estimate a value (expected return) with several simulations of entire games starting from current state)
Tree Policy: Selection → Expansion
Default Policy: Run a simulation using quick policy
Update action values Q and visit counts of all traversed edges

Efficient MCTS only search plausible states with reduced breadth (actions) and depth (result)
New tree policy to reduce breadth with policy network

The Q encourages while the N encourages

Alpha zero do not MCTS rollout with only simulations. No human data, no domain knowledge only self-play. In other words, it does not go to the end for single policy consistency.

Model learning

Model-Predictive Control

CEM Iteration or random shooting can be used in step 3

It replans to correct for model errors but only practical for short-horizon problems since it is computationally expensive and not accurate for long plans.

TD-MPC

Planning short horizon with terminal value function can find local planning that maximizes global value, since MPC yields temporally locally optimal solutions and the value function approximates the globally optimal solution.

Estimating Q-targets via planning is very slow, hence we use policy instead.

TD-MPC = Latent dynamics + CEM-MPC w/ TD-learning. In other words, it use CEM-MPC for inference, plan in (task-oriented) latent space and train on maximizing return estimated by Q-value. In some sense, TD-MPC combines model-based planning (CEM-MPC) with model-free policy optimization (REPARAMETRIZE) and TD-learning (Q-learning).

TD-MPC solves challenging continuous control problems.

Dreamer

Dyna (Online Q-learning with a model) style model-based RL

First time model-based RL outperforms model-free RL on pixels (Dreamer V1)

DreamerV2 on discrete envs, DreamerV3 on all evns and DayDreamer for physical robot learning.

Representation learning mostly relies on observation reconstruction which means i works well with high-dimensional, multi-modal observations but requires a high GPU memory usage and slow update due to large world models.

**value network and policy network predicts based on latent state (latent imagination)**

Reward model

Imitation learning mimics actions of expert without reasoning about outcomes or dynamics. The expert might have different degrees of freedom and it could not possible to provide demonstrations.

Goal classifier with success examples and general reward classifier with demonstration trajectories.

Goal classifier

The RL algorithm will seek out states that the classifier thinks are good. Can we prevent the RL algorithm from exploiting the classifier’s weaknesses? We update the classifier during RL, using policy data as negative.

Generator will match data distribution at convergence.

Behavior Cloning: action | state

GAIL: state action pair

GAIfO: state only without expert action demo

Pairwise preferences are easy to provide without goals or demos.

Critique is easier than generation! but we still need supervision.

RL agent learns skills (options) without environment reward ()

Maximum Entropy Objective

Action entropy is not the same as state entropy and MaxEnt policies are stochastic, but not controllable. Diversity-promoting reward function could be obtained through discriminator.

Goal of skill policy = minimize = maximize .

Mutual information between skills and states can be maximized by maximizing that reward definition.

Maximizing MI does NOT encourage skills to cover diverse states and only focus on easy skills. Taking changes in states (distance) into account could help incentivize challenging behaviors. However, only small (but diverse) state changes can also maximize MI.

Euclidean distance-maximizing skill discovery may not learn static skills. Any distance can be used based on domain knowledge.

max 에서 차와 z를 곱하는 것으로 direction과 크기 모두를 고려하는 것을 확인할 수 있다. move as far as it can along direction z (align these 2)

The Euclidean distance does not necessarily correspond to the behaviors of our interests.

Controllability-aware skill discovery

Learn what are easy-to-control states and hard-to-control states by imposing more reward when changing hard-to-control states.

High-level policy makes high-level decision while Low-level policy makes low-level decision (Many design choices)

HRL helps solving long-horizon complex tasks with temporally extended exploration and simplified credit assignment.

Skill chaining

To chain more skills, we need to increase initiation set, while keeping termination set small.

T-STAR

Transition Policy

Skill dynamics model (learn skill prior - task policy)

skill dynamics model improve sample efficiency for long-horizon tasks

SPiRL

SkiMo

Homework

Homework 3 DQN


        if np.random.random() < epsilon:
          action = torch.tensor([np.random.randint(self.num_actions)])
        else:
          with torch.no_grad():
            q_values = self.critic(observation)
            action = q_values.argmax(dim=1)


        # Compute target values
        with torch.no_grad():
            # HINT: first, compute q(s',a') values for all actions with the target network
            # (we will later find the maximum values among all possible actions)
            next_qa_values = self.target_critic(next_obs)
            if self.use_double_q:
                # HINT: find the best action by evaluting q(s',a') with the current critic
                next_action = self.critic(next_obs).argmax(dim=1, keepdim=True)
            else:
                # HINT: find the next action to choose (hint: argmax) among `next_qa_values`
                next_action = next_qa_values.argmax(dim=1, keepdim=True)
            # HINT: find the next q values among `next_qa_values` by specifying with `next_action`
            next_q_values = next_qa_values.gather(1, next_action).squeeze()
            target_values = reward + self.discount * (~done) * next_q_values
        qa_values = self.critic(obs)
        q_values = qa_values.gather(1, action.unsqueeze(1)).squeeze()
        assert q_values.shape == target_values.shape
        # HINT: compute loss with q_values and target_values
        loss = self.critic_loss(q_values, target_values.detach())
        self.critic_optimizer.zero_grad()
        loss.backward()
        self.critic_optimizer.step()
        self.lr_scheduler.step()

compute loss

optimizer zero grad

loss backward

step optimizer

step scheduler

Homework 4


        if self.target_critic_backup_type == "doubleq":
            indices = torch.arange(num_critic_networks).to(ptu.device)
            double_q_values = torch.where(indices.unsqueeze(1) % 2 == 0, next_qs[1], next_qs[0])
            return double_q_values
        elif self.target_critic_backup_type == "min":
            min_qs, _ = torch.min(next_qs, dim=0, keepdim=True)
            return min_qs.expand_as(next_qs)


    def entropy(self, action_distribution: torch.distributions.Distribution):
        action = action_distribution.rsample()
        entropy = -action_distribution.log_prob(action)

        assert entropy.shape == action.shape[:-1]
        return entropy


next_action_entropy = self.entropy(next_action_distribution)
next_qs = next_qs.clone() + self.temperature * next_action_entropy.unsqueeze(0)

            # TODO(student): Compute the target Q-value
            target_values: torch.Tensor = reward + self.discount * ~done * next_qs
            assert target_values.shape == (
                self.num_critic_networks,
                batch_size
            )

        # TODO(student): Predict Q-values using `self.critic`
        q_values = self.critic(obs, action)
        assert q_values.shape == (self.num_critic_networks, batch_size), q_values.shape
        loss: torch.Tensor = self.critic_loss(q_values, target_values)
        self.critic_optimizer.zero_grad()
        loss.backward()
        self.critic_optimizer.step()


            obs_expanded = obs.unsqueeze(0).expand(self.num_actor_samples, *obs.shape).reshape(-1, *self.observation_shape)
            action_expanded = action.unsqueeze(0).reshape(-1, self.action_dim)
            q_values = self.critic(obs_expanded, action_expanded).view(self.num_critic_networks, self.num_actor_samples, batch_size)
            assert q_values.shape == (
                self.num_critic_networks,
                self.num_actor_samples,
                batch_size,
            ), q_values.shape
            
            # Our best guess of the Q-values is the mean of the ensemble
            q_values = torch.mean(q_values, axis=0)

        # Do REINFORCE (without baseline)
        # TODO(student): Calculate log-probs
        log_probs = action_distribution.log_prob(action)
        assert log_probs.shape == q_values.shape

        # TODO(student): Compute policy gradient using log-probs and Q-values
        loss = -(log_probs * q_values).mean()


    def actor_loss_reparametrize(self, obs: torch.Tensor):
        batch_size = obs.shape[0]

        # Sample from the actor
        action_distribution: torch.distributions.Distribution = self.actor(obs)

        # TODO(student): Sample actions
        # Note: Think about whether to use .rsample() or .sample() here...
        action = action_distribution.rsample()

        # TODO(student): Compute Q-values for the sampled state-action pair
        q_values = self.critic(obs, action).mean(0)

        # TODO(student): Compute the actor loss using Q-values
        loss = -q_values.mean()

Homework 5 CQL


    def dqn_loss(self, ob_no, ac_na, next_ob_no, reward_n, terminal_n):
        qa_t_values = self.q_net(ob_no)
        q_t_values = torch.gather(qa_t_values, 1, ac_na.unsqueeze(1)).squeeze(1)
        qa_tp1_values = self.q_net_target(next_ob_no)

        next_actions = self.q_net(next_ob_no).argmax(dim=1)
        q_tp1 = torch.gather(qa_tp1_values, 1, next_actions.unsqueeze(1)).squeeze(1)

        target = reward_n + self.gamma * q_tp1 * (1 - terminal_n)
        target = target.detach()
        loss = self.loss(q_t_values, target)

        return loss, qa_t_values, q_t_values


        # TODO: CQL Implementation (Equation 1)
        # Hint #1: Obtain dqn_loss, qa_t_values, q_t_values using self.dqn_loss (the first summation in Equation 1)
        # Hint #2: Compute q_t_logsumexp in cql regularizer using torch.logsumexp
        # Hint #3: Compute cql_loss using q_t_logsumexp and q_t_values (the second summation in Equation 1)
        # Hint #4: Finally, compute loss using dqn_loss and cql_loss
        dqn_loss, qa_t_values, q_t_values = self.dqn_loss(ob_no, ac_na, next_ob_no, reward_n, terminal_n)
        q_t_logsumexp = torch.logsumexp(qa_t_values, dim=1)
        cql_loss =  self.cql_alpha * (q_t_logsumexp - q_t_values).mean()
        loss = dqn_loss + cql_loss

q_t_values 실제로 취한 행동 D

qa_t_values 모든 가능한 행동 a (logsumexp)

실제 한 행동 overestimate 하는데 loss에서 빼주는 이유는 반대이기 때문

시험

loss 랑 엔트로피 등

skill learning 사용할 때 문제점 gail

IQL ex~ 적용위치

CQL ood query term

skill dynamics한계와 chaining 2 일때와 3 일때