You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: hw2/README.md
+23Lines changed: 23 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,3 +14,26 @@ Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the p
14
14
The only file that you need to look at is `train_pg_f18.py`, which you will implement.
15
15
16
16
See the [HW2 PDF](http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf) for further instructions.
17
+
18
+
# Answers to Homework Experiments
19
+
## Problem 4 (CartPole)
20
+
### Summary
21
+
The benchmark included running multiple experiments with tuning parameters like using [rewards to go, monte carlo rewards], [advantage normalization, no advantage normalization], [large batch size, small batch size]. Then number of iterations were 100 per experiment and each configuration were experimented 3 times to understand variance as well. Below are general observations:
22
+
- Convergence: using reward to go resulted into faster convergence than monte carlo reward
23
+
- Variance: the following parameters helped reducing the variance: increasing batch size and advantage normalization
24
+
25
+
### Plots
26
+
27
+

28
+
29
+

30
+
31
+
### Answers
32
+
Q1- Which gradient estimator has better performance without advantage-centering—the trajectory-centric one, or the one using reward-to-go?
33
+
> The reward to go is better because it has lower variance.
34
+
35
+
Q2- Did advantage centering help?
36
+
> Yes it did help reduce the variance and speed up convergence a bit
37
+
38
+
Q3- Did the batch size make an impact?
39
+
> Yes it did, larger batch sizes result in lower variance and low bias
0 commit comments