Skip to content

Commit 42a1fa1

Browse files
author
Abdelrahman Ogail
committed
Solution to problem berkeleydeeprlcourse#5
1 parent 2919a91 commit 42a1fa1

File tree

3 files changed

+42
-20
lines changed

3 files changed

+42
-20
lines changed

hw2/README.md

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
# CS294-112 HW 2: Policy Gradient
22

33
Dependencies:
4-
* Python **3.5**
5-
* Numpy version **1.14.5**
6-
* TensorFlow version **1.10.5**
7-
* MuJoCo version **1.50** and mujoco-py **1.50.1.56**
8-
* OpenAI Gym version **0.10.5**
9-
* seaborn
10-
* Box2D==**2.3.2**
4+
5+
- Python **3.5**
6+
- Numpy version **1.14.5**
7+
- TensorFlow version **1.10.5**
8+
- MuJoCo version **1.50** and mujoco-py **1.50.1.56**
9+
- OpenAI Gym version **0.10.5**
10+
- seaborn
11+
- Box2D==**2.3.2**
1112

1213
Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file.
1314

@@ -16,11 +17,15 @@ The only file that you need to look at is `train_pg_f18.py`, which you will impl
1617
See the [HW2 PDF](http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf) for further instructions.
1718

1819
# Answers to Homework Experiments
20+
1921
## Problem 4 (CartPole)
22+
2023
### Summary
24+
2125
The benchmark included running multiple experiments with tuning parameters like using [rewards to go, monte carlo rewards], [advantage normalization, no advantage normalization], [large batch size, small batch size]. Then number of iterations were 100 per experiment and each configuration were experimented 3 times to understand variance as well. Below are general observations:
22-
- Convergence: using reward to go resulted into faster convergence than monte carlo reward
23-
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
26+
27+
- Convergence: using reward to go resulted into faster convergence than monte carlo reward
28+
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
2429

2530
### Plots
2631

@@ -29,11 +34,27 @@ The benchmark included running multiple experiments with tuning parameters like
2934
![](fig/sb_CartPole-v0.png)
3035

3136
### Answers
37+
3238
Q1- Which gradient estimator has better performance without advantage-centering—the trajectory-centric one, or the one using reward-to-go?
39+
3340
> The reward to go is better because it has lower variance.
3441
3542
Q2- Did advantage centering help?
43+
3644
> Yes it did help reduce the variance and speed up convergence a bit
3745
3846
Q3- Did the batch size make an impact?
47+
3948
> Yes it did, larger batch sizes result in lower variance and low bias
49+
50+
## Problem 5
51+
52+
### Summary
53+
54+
The command below is used to get the fig
55+
56+
```bash
57+
python3 train_pg_f18.py InvertedPendulum-v2 -n 100 -b 5000 -e 5 -rtg --exp_name hc_b5000_r0.0111 --learning_rate 1e-2 --n_layers 2 --size 16
58+
```
59+
60+
![](fig/InvertedPendulum-v2.png)

hw2/fig/InvertedPendulum-v2.png

52.3 KB
Loading

hw2/train_pg_f18.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -207,9 +207,9 @@ def sample_action(self, policy_parameters):
207207
else:
208208
sy_mean, sy_logstd = policy_parameters
209209
# YOUR_CODE_HERE
210-
sy_sampled_ac = sy_mean + tf.multipy(tf.math.exp(sy_logstd),
211-
tf.random_normal(shape=sy_mean.shape))
212-
assert sy_sampled_ac.shape.as_list() == [sy_mean.shape.as_list()]
210+
sy_sampled_ac = sy_mean + \
211+
tf.math.multiply(tf.math.exp(sy_logstd), tf.random_normal(shape=sy_logstd.shape))
212+
assert sy_sampled_ac.shape.as_list() == sy_mean.shape.as_list()
213213
return sy_sampled_ac
214214

215215
#========================================================================================#
@@ -250,9 +250,11 @@ def get_log_prob(self, policy_parameters, sy_ac_na):
250250
# initialize a single self.ac_dim-variate Gaussian.
251251
mvn = tf.contrib.distributions.MultivariateNormalDiag(loc=sy_mean,
252252
scale_diag=tf.math.exp(sy_logstd))
253-
sy_logprob_n = mvn.log_prob(sy_ac_na)
254-
255-
assert sy_logprob_n.shape.as_list() == sy_mean.shape.as_list()
253+
# CORRECTION: because log probability is negative and because of loss expects +ve values
254+
# the log prob is multiplied by -1 to enable optimzation process to work
255+
sy_logprob_n = -mvn.log_prob(sy_ac_na)
256+
assert sy_logprob_n.shape.as_list() == [sy_mean.shape.as_list()[0]]
257+
self.sy_logprob_n = sy_logprob_n
256258
return sy_logprob_n
257259

258260
def build_computation_graph(self):
@@ -294,7 +296,6 @@ def build_computation_graph(self):
294296
# Loss Function and Training Operation
295297
#========================================================================================#
296298
# YOUR CODE HERE
297-
# EXPERIMENT use * instead of tf.multiply operator
298299
self.loss = tf.reduce_mean(self.sy_logprob_n * self.sy_adv_n)
299300
self.update_op = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)
300301

@@ -350,11 +351,11 @@ def sample_trajectory(self, env, animate_this_episode):
350351
#====================================================================================#
351352
# ----------PROBLEM 3----------
352353
#====================================================================================#
353-
ac = self.sess.run(self.sy_sampled_ac, feed_dict={
354-
self.sy_ob_no: ob[None]}) # YOUR CODE HERE
354+
# YOUR CODE HERE
355+
ac = self.sess.run(self.sy_sampled_ac, feed_dict={self.sy_ob_no: ob[None]})
355356
ac = ac[0]
356357
acs.append(ac)
357-
ob, rew, done, _ = env.step(ac.squeeze())
358+
ob, rew, done, _ = env.step(ac)
358359
rewards.append(rew)
359360
steps += 1
360361
if done or steps > self.max_path_length:
@@ -564,7 +565,7 @@ def update_parameters(self, ob_no, ac_na, q_n, adv_n):
564565
# YOUR_CODE_HERE
565566
_, loss, summary = self.sess.run([self.update_op, self.loss, self.merged],
566567
feed_dict={self.sy_ob_no: ob_no,
567-
self.sy_ac_na: ac_na.squeeze(),
568+
self.sy_ac_na: ac_na,
568569
self.sy_adv_n: adv_n})
569570

570571
# write logs at every iteration

0 commit comments

Comments
 (0)