🏷️sec_rmsprop
One of the key issues in :numref:sec_adagrad
is that the learning rate decreases at a predefined schedule of effectively
:cite:Tieleman.Hinton.2012
proposed the RMSProp algorithm as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the squares of the gradient
One way of fixing this problem would be to use
Let's write out the equations in detail.
$$\begin{aligned} \mathbf{s}t & \leftarrow \gamma \mathbf{s}{t-1} + (1 - \gamma) \mathbf{g}_t^2, \ \mathbf{x}t & \leftarrow \mathbf{x}{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t. \end{aligned}$$
The constant
$$ \begin{aligned} \mathbf{s}t & = (1 - \gamma) \mathbf{g}t^2 + \gamma \mathbf{s}{t-1} \ & = (1 - \gamma) \left(\mathbf{g}t^2 + \gamma \mathbf{g}{t-1}^2 + \gamma^2 \mathbf{g}{t-2} + \ldots, \right). \end{aligned} $$
As before in :numref:sec_momentum
we use
%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np, npx
npx.set_np()
#@tab pytorch
from d2l import torch as d2l
import torch
import math
#@tab tensorflow
from d2l import tensorflow as d2l
import tensorflow as tf
import math
#@tab all
d2l.set_figsize()
gammas = [0.95, 0.9, 0.8, 0.7]
for gamma in gammas:
x = d2l.numpy(d2l.arange(40))
d2l.plt.plot(x, (1-gamma) * gamma ** x, label=f'gamma = {gamma:.2f}')
d2l.plt.xlabel('time');
As before we use the quadratic function sec_adagrad
, when we used Adagrad with a learning rate of 0.4, the variables moved only very slowly in the later stages of the algorithm since the learning rate decreased too quickly. Since
#@tab all
def rmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + (1 - gamma) * g1 ** 2
s2 = gamma * s2 + (1 - gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta, gamma = 0.4, 0.9
d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))
Next, we implement RMSProp to be used in a deep network. This is equally straightforward.
#@tab mxnet,pytorch
def init_rmsprop_states(feature_dim):
s_w = d2l.zeros((feature_dim, 1))
s_b = d2l.zeros(1)
return (s_w, s_b)
#@tab tensorflow
def init_rmsprop_states(feature_dim):
s_w = tf.Variable(d2l.zeros((feature_dim, 1)))
s_b = tf.Variable(d2l.zeros(1))
return (s_w, s_b)
def rmsprop(params, states, hyperparams):
gamma, eps = hyperparams['gamma'], 1e-6
for p, s in zip(params, states):
s[:] = gamma * s + (1 - gamma) * np.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / np.sqrt(s + eps)
#@tab pytorch
def rmsprop(params, states, hyperparams):
gamma, eps = hyperparams['gamma'], 1e-6
for p, s in zip(params, states):
with torch.no_grad():
s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()
#@tab tensorflow
def rmsprop(params, grads, states, hyperparams):
gamma, eps = hyperparams['gamma'], 1e-6
for p, s, g in zip(params, states, grads):
s[:].assign(gamma * s + (1 - gamma) * tf.math.square(g))
p[:].assign(p - hyperparams['lr'] * g / tf.math.sqrt(s + eps))
We set the initial learning rate to 0.01 and the weighting term
#@tab all
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(rmsprop, init_rmsprop_states(feature_dim),
{'lr': 0.01, 'gamma': 0.9}, data_iter, feature_dim);
Since RMSProp is a rather popular algorithm it is also available in the Trainer
instance. All we need to do is instantiate it using an algorithm named rmsprop
, assigning gamma1
.
d2l.train_concise_ch11('rmsprop', {'learning_rate': 0.01, 'gamma1': 0.9},
data_iter)
#@tab pytorch
trainer = torch.optim.RMSprop
d2l.train_concise_ch11(trainer, {'lr': 0.01, 'alpha': 0.9},
data_iter)
#@tab tensorflow
trainer = tf.keras.optimizers.RMSprop
d2l.train_concise_ch11(trainer, {'learning_rate': 0.01, 'rho': 0.9},
data_iter)
- RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients.
- RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner.
- The learning rate needs to be scheduled by the experimenter in practice.
- The coefficient
$\gamma$ determines how long the history is when adjusting the per-coordinate scale.
- What happens experimentally if we set
$\gamma = 1$ ? Why? - Rotate the optimization problem to minimize
$f(\mathbf{x}) = 0.1 (x_1 + x_2)^2 + 2 (x_1 - x_2)^2$ . What happens to the convergence? - Try out what happens to RMSProp on a real machine learning problem, such as training on Fashion-MNIST. Experiment with different choices for adjusting the learning rate.
- Would you want to adjust
$\gamma$ as optimization progresses? How sensitive is RMSProp to this?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: