YSU RL HW2

The training begins with the script run hw2.py. This script contains the function run training loop, where the training, evaluation, and logging happens. The agent, also defined in run hw2.py, is an instance of PGAgent in aai4160/agents/pg agent.py. The PGAgent class has two main networks as its variables:

actor network: MLPPolicyPG in aai4160/networks/policies.py, takes as input observations and outputs actions.

critic network: ValueCritic in aai4160/networks/critics.py, also takes as input observations but outputs value estimates. You will need to implement this in Section 5.

The training iteration begins with sampling trajectories using sample trajectories() defined in infrastructure/utils.py and updating the agent via PGAgent.update() in agents/pg agent.py. In this update() procedure, PGAgent first processes rewards into Q-values (PGAgent. calculate q vals()) and then, estimates advantages (PGAgent. estimate advantage()). Then, the actor and the critic are updated via MLPPolicyPG.update() in aai4160/networks/policies.py and ValueCritic.update() in aai4160/networks/critics.py, respectively. If the --use ppo flag is set, the MLPPolicyPG.ppo update() method is used instead of the standard MLPPolicyPG.update() method for the actor. You will work on the MLPPolicyPG.ppo update() method in Section

When you are calculating the log probability of the action for `torch.distributions.Normal`, you should sum up the last dimensions unlike as in the discrete case that uses `torch.distributions.Categorical`.

Section 4

[Clarification] There are some TODOs in `run_hw2.py`, which is already implemented. You can skip these TODOs in `run_hw2.py`. However, those parts are important in the flow of the training, so please take a look and understand what they are doing.

[Clarification] Even when there is no `self.critic` in `PGAgent`, the code requires the return value from `PGAgent._estimate_advantage` . In this case, you can set the return value simply to `q_values`.

[HINT] When you are implementing Policy Gradients (Section 4), you need to calculate the log probability of the action from the distribution. `torch.distributions.Distribution` object has its own method `log_prob`, so please use this method. This will make the implementation much more concise.

[HINT] The output distribution of the actor for continuous actions should be `torch.distributions.Normal`. In this case, the return value of `torch.distributions.Distribution.log_prob` would have shape of (Ndactions), where N is the batch size and dactions is the number of dimensions of an action. Therefore, to calculate the log probability for the action with dactions dimensions, you will need to sum the log prob values in the last dim. Note that the calculated log probability should have the shape of (N,).

(𝑁,𝑑𝑎𝑐𝑡𝑖𝑜𝑛𝑠)

𝑁

𝑑𝑎𝑐𝑡𝑖𝑜𝑛𝑠

(𝑁,).

[HINT] You may encounter `division by 0 error` when implementing advantage normalization (Section 4). To prevent this, it is a common practice to have an `epsilon` term in the denominator. i.e. `advantage = (advantage - mean) / (std + epsilon)`, where `epsilon` is a very small value like 1e-5 ~ 1e-9.

[Clarification] In the code, `batch_size` (number of transitions used for updating the agent) can be larger than `args.batch_size`. You do not need to truncate the arrays to have the length `args.batch_size`. Please use all the transitions collected in the iteration, by simply concatenating the arrays. (www.classum.com/main/course/108910/community/51)

Section 5

[HINT] In Section 5, you are asked to implement the Neural Network Baseline, or the critic, to reduce the variance of the gradient estimates. As briefly stated in Section 2.2.3 (Baseline), the critic will be trained to regress sample q-value estimates. So the critic should minimize this objective:L(ϕ)=T1∑t=0T−1(Vϕπ(st)−yt)2 where yt=∑k=0T−t−1γkrt+k. In other words, the critic will be trained to minimize its prediction errors for the reward-to-go estimates value. For the implementation, consider calculating sample-wise losses for every sample in the batch and then taking a mean.

𝐿(𝜙)=1𝑇∑𝑡=0𝑇−1(𝑉𝜙𝜋(𝑠𝑡)−𝑦𝑡)2

𝑦𝑡=∑𝑘=0𝑇−𝑡−1𝛾𝑘𝑟𝑡+𝑘

Section 6

[Clarification] For the GAE experiments in Section 6, the results varies a lot with different seeds. Therefore, we will not apply a strict grading criterion regarding the performance. Just make sure your result with the best run has some consecutive interval reaching performance close to the score of 180. We provide you with an example experiment results: If you have correctly implemented the GAE part, you would have similar results to the figure below.

[HINT] Here is a simple test case to check if your GAE calculation (in `PGAgent._estimate_advantage`) is correct.


# agent.gamma = 0.99
# agent.gae_lambda = 0.95
rewards = np.array([1.0, 2.0, 3.0, 1.0, 2.0, 3.0])
terminals = np.array([0.0, 0.0, 1.0, 0.0, 0.0, 1.0])
values = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0]) # suppose that these are the values predicted by the critic.

# the estimated advantage should be: 
# [ 4.773285  2.97    0.     2.06144925 0.1185  -3.    ]

PPO

`old_logp` has correct log probability values,

prob ratio has been correctly calculated from `old_logp` and `logp` (Note that these are log of the probabilities, so you would need to take the exp)

PPO clip loss has been calculated exactly as stated.

Response

If only a single episode was collected during the evaluation, the standard deviation of the evaluation return (Eval_StdReturn) would be 0 and the mean (Eval_AverageReturn) would be the return of that episode. Note that you can collect more trajectories for evaluation by increasing --eval_batch_size in your command line argument.

You have to use GAE advantages for PPO, as in the original implementation. The training arguments for PPO already have --gae_lambda 0.97, so if you have correctly finished previous sections, GAE would be automatically used.

It is normal to have the negative average return at the beginning, since HalfCheetah-v4 can have negative rewards. (if you are curious, please refer to this docs for the details of the reward:

About the `Actor Loss`, you cannot expect the loss going down even if the implementation is correct. This is because the loss here is different from that of standard supervised learning. Note that the loss is not equal to the expected return, and the only thing that we consider is that the gradient of that loss equals to negative gradient of expected return. So it is generally hard to interpret the actor loss here, and it's normal to have those kind of plots.

Here, `batch_size` is simply the number of transitions used for updating the agent, and can be bigger than `args.batch_size` (because the episodes can have more transitions than `min_timesteps_per_batch` as you mentioned). You do not need to truncate the arrays to use only `args.batch_size` transitions. The homework was meant to use all the transitions collected in the iteration, so please concatenate the transitions from all the collected trajectories even if the length exceeds `args.batch_size`.

They all sound like, we need to utilize Q-value as ideal state value function directly for training Critic(value function). (Or maybe I misunderstood it). If so, then is it a formal way? I get confused, why we should utilize Q value in batch for training critics.(or maybe we can utilize bellman equation?) I think there may be high possibility of overfitting.

If you take the expectation of those reward-to-go estimates under the distribution of the actions from the policy, you are essentially getting the state value function.

If we train the critic to minimize the mean squared error, it should converge to output the mean of the reward-to-go estimates (=sample q value estimates). And the mean (=expectation) corresponds to the value function.

This relates to the bellman equation from the lecture slide: Expectation of Q function under the policy equals to the value function. Please see the equation below.

using sample estimates can have high variance, but it is unbiased and correct.

For the overfitting issue, since target value(q_values which is just sum of rewards) is from just one sample, it has noise even though unbiased so I also think it could have problem. But by updating with new samples in each iteration and also use entropy term for exploration, we could avoid to stay in local minima.(correct me if I am wrong!)

Also for Bellman, since the target is also function of parameters we should set target to be constant to parameters(maybe torch.no_grad.. not sure about this).

For the overfitting issue, since target value(q_values which is just sum of rewards) is from just one sample, it has noise even though unbiased so I also think it could have problem. But by updating with new samples in each iteration and also use entropy term for exploration, we could avoid to stay in local minima.(correct me if I am wrong!)

Also for Bellman, since the target is also function of parameters we should set target to be constant to parameters(maybe torch.no_grad.. not sure about this).

클라썸 - 소통 중심의 성장 플랫폼

참여도 높은 교육은 물론 지식과 노하우 공유까지 소통 중심의 성장 플랫폼 클라썸에서 온·오프라인 교육을 매끄럽게 넘나들며 활발하게 소통하세요.

https://www.classum.com/main/course/108910/community/41

YSU RL HW2

PPO

Response

Recommendations