Solution to problem berkeleydeeprlcourse#5

Abdelrahman Ogail · Abdelrahman Ogail · commit 42a1fa18eb12 · 2019-04-18T19:05:18.000-07:00
diff --git a/hw2/README.md b/hw2/README.md
@@ -1,13 +1,14 @@
 # CS294-112 HW 2: Policy Gradient
 
 Dependencies:
- * Python **3.5**
- * Numpy version **1.14.5**
- * TensorFlow version **1.10.5**
- * MuJoCo version **1.50** and mujoco-py **1.50.1.56**
- * OpenAI Gym version **0.10.5**
- * seaborn
- * Box2D==**2.3.2**
+
+-   Python **3.5**
+-   Numpy version **1.14.5**
+-   TensorFlow version **1.10.5**
+-   MuJoCo version **1.50** and mujoco-py **1.50.1.56**
+-   OpenAI Gym version **0.10.5**
+-   seaborn
+-   Box2D==**2.3.2**
 
 Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file.
 
@@ -16,11 +17,15 @@ The only file that you need to look at is `train_pg_f18.py`, which you will impl
 See the [HW2 PDF](http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf) for further instructions.
 
 # Answers to Homework Experiments
+
 ## Problem 4 (CartPole)
+
 ### Summary
+
 The benchmark included running multiple experiments with tuning parameters like using [rewards to go, monte carlo rewards], [advantage normalization, no advantage normalization], [large batch size, small batch size]. Then number of iterations were 100 per experiment and each configuration were experimented 3 times to understand variance as well. Below are general observations:
-- Convergence: using reward to go resulted into faster convergence than monte carlo reward
-- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
+
+-   Convergence: using reward to go resulted into faster convergence than monte carlo reward
+-   Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
 
 ### Plots
 
@@ -29,11 +34,27 @@ The benchmark included running multiple experiments with tuning parameters like
 ![](fig/sb_CartPole-v0.png)
 
 ### Answers
+
 Q1- Which gradient estimator has better performance without advantage-centering—the trajectory-centric one, or the one using reward-to-go?
+
 > The reward to go is better because it has lower variance.
 
 Q2- Did advantage centering help?
+
 > Yes it did help reduce the variance and speed up convergence a bit
 
 Q3- Did the batch size make an impact?
+
 > Yes it did, larger batch sizes result in lower variance and low bias
+
+## Problem 5
+
+### Summary
+
+The command below is used to get the fig
+
+```bash
+python3 train_pg_f18.py InvertedPendulum-v2 -n 100 -b 5000 -e 5 -rtg --exp_name hc_b5000_r0.0111 --learning_rate 1e-2 --n_layers 2 --size 16
+```
+
+![](fig/InvertedPendulum-v2.png)
diff --git a/hw2/fig/InvertedPendulum-v2.png b/hw2/fig/InvertedPendulum-v2.png
diff --git a/hw2/train_pg_f18.py b/hw2/train_pg_f18.py
@@ -207,9 +207,9 @@ def sample_action(self, policy_parameters):
         else:
             sy_mean, sy_logstd = policy_parameters
             # YOUR_CODE_HERE
-            sy_sampled_ac = sy_mean + tf.multipy(tf.math.exp(sy_logstd),
-                                                 tf.random_normal(shape=sy_mean.shape))
-            assert sy_sampled_ac.shape.as_list() == [sy_mean.shape.as_list()]
+            sy_sampled_ac = sy_mean + \
+                tf.math.multiply(tf.math.exp(sy_logstd), tf.random_normal(shape=sy_logstd.shape))
+            assert sy_sampled_ac.shape.as_list() == sy_mean.shape.as_list()
         return sy_sampled_ac
 
     #========================================================================================#
@@ -250,9 +250,11 @@ def get_log_prob(self, policy_parameters, sy_ac_na):
             # initialize a single self.ac_dim-variate Gaussian.
             mvn = tf.contrib.distributions.MultivariateNormalDiag(loc=sy_mean,
                                                                   scale_diag=tf.math.exp(sy_logstd))
-            sy_logprob_n = mvn.log_prob(sy_ac_na)
-
-            assert sy_logprob_n.shape.as_list() == sy_mean.shape.as_list()
+            # CORRECTION: because log probability is negative and because of loss expects +ve values
+            # the log prob is multiplied by -1 to enable optimzation process to work
+            sy_logprob_n = -mvn.log_prob(sy_ac_na)
+            assert sy_logprob_n.shape.as_list() == [sy_mean.shape.as_list()[0]]
+        self.sy_logprob_n = sy_logprob_n
         return sy_logprob_n
 
     def build_computation_graph(self):
@@ -294,7 +296,6 @@ def build_computation_graph(self):
         # Loss Function and Training Operation
         #========================================================================================#
         # YOUR CODE HERE
-        # EXPERIMENT use * instead of tf.multiply operator
         self.loss = tf.reduce_mean(self.sy_logprob_n * self.sy_adv_n)
         self.update_op = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)
 
@@ -350,11 +351,11 @@ def sample_trajectory(self, env, animate_this_episode):
             #====================================================================================#
             #                           ----------PROBLEM 3----------
             #====================================================================================#
-            ac = self.sess.run(self.sy_sampled_ac, feed_dict={
-                               self.sy_ob_no: ob[None]})  # YOUR CODE HERE
+            # YOUR CODE HERE
+            ac = self.sess.run(self.sy_sampled_ac, feed_dict={self.sy_ob_no: ob[None]})
             ac = ac[0]
             acs.append(ac)
-            ob, rew, done, _ = env.step(ac.squeeze())
+            ob, rew, done, _ = env.step(ac)
             rewards.append(rew)
             steps += 1
             if done or steps > self.max_path_length:
@@ -564,7 +565,7 @@ def update_parameters(self, ob_no, ac_na, q_n, adv_n):
         # YOUR_CODE_HERE
         _, loss, summary = self.sess.run([self.update_op, self.loss, self.merged],
                                          feed_dict={self.sy_ob_no: ob_no,
-                                                    self.sy_ac_na: ac_na.squeeze(),
+                                                    self.sy_ac_na: ac_na,
                                                     self.sy_adv_n: adv_n})
 
         # write logs at every iteration