You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* MuJoCo version **1.50** and mujoco-py **1.50.1.56**
8
-
* OpenAI Gym version **0.10.5**
9
-
* seaborn
10
-
* Box2D==**2.3.2**
4
+
5
+
- Python **3.5**
6
+
- Numpy version **1.14.5**
7
+
- TensorFlow version **1.10.5**
8
+
- MuJoCo version **1.50** and mujoco-py **1.50.1.56**
9
+
- OpenAI Gym version **0.10.5**
10
+
- seaborn
11
+
- Box2D==**2.3.2**
11
12
12
13
Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file.
13
14
@@ -16,11 +17,15 @@ The only file that you need to look at is `train_pg_f18.py`, which you will impl
16
17
See the [HW2 PDF](http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf) for further instructions.
17
18
18
19
# Answers to Homework Experiments
20
+
19
21
## Problem 4 (CartPole)
22
+
20
23
### Summary
24
+
21
25
The benchmark included running multiple experiments with tuning parameters like using [rewards to go, monte carlo rewards], [advantage normalization, no advantage normalization], [large batch size, small batch size]. Then number of iterations were 100 per experiment and each configuration were experimented 3 times to understand variance as well. Below are general observations:
22
-
- Convergence: using reward to go resulted into faster convergence than monte carlo reward
23
-
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
26
+
27
+
- Convergence: using reward to go resulted into faster convergence than monte carlo reward
28
+
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
24
29
25
30
### Plots
26
31
@@ -29,11 +34,27 @@ The benchmark included running multiple experiments with tuning parameters like
29
34

30
35
31
36
### Answers
37
+
32
38
Q1- Which gradient estimator has better performance without advantage-centering—the trajectory-centric one, or the one using reward-to-go?
39
+
33
40
> The reward to go is better because it has lower variance.
34
41
35
42
Q2- Did advantage centering help?
43
+
36
44
> Yes it did help reduce the variance and speed up convergence a bit
37
45
38
46
Q3- Did the batch size make an impact?
47
+
39
48
> Yes it did, larger batch sizes result in lower variance and low bias
0 commit comments