🏷️sec_momentum
In :numref:sec_sgd
we reviewed what happens when performing stochastic gradient descent, i.e., when performing optimization where only a noisy variant of the gradient is available. In particular, we noticed that for noisy gradients we need to be extra cautious when it comes to choosing the learning rate in the face of noise. If we decrease it too rapidly, convergence stalls. If we are too lenient, we fail to converge to a good enough solution since noise keeps on driving us away from optimality.
In this section, we will explore more effective optimization algorithms, especially for certain types of optimization problems that are common in practice.
The previous section saw us discussing minibatch SGD as a means for accelerating computation. It also had the nice side-effect that averaging gradients reduced the amount of variance. The minibatch stochastic gradient descent can be calculated by:
$$\mathbf{g}{t, t-1} = \partial{\mathbf{w}} \frac{1}{|\mathcal{B}t|} \sum{i \in \mathcal{B}t} f(\mathbf{x}{i}, \mathbf{w}_{t-1}) = \frac{1}{|\mathcal{B}t|} \sum{i \in \mathcal{B}t} \mathbf{h}{i, t-1}. $$
To keep the notation simple, here we used $\mathbf{h}{i, t-1} = \partial{\mathbf{w}} f(\mathbf{x}i, \mathbf{w}{t-1})$ as the stochastic gradient descent for sample
$$\mathbf{v}t = \beta \mathbf{v}{t-1} + \mathbf{g}_{t, t-1}$$
for some
$$\begin{aligned} \mathbf{v}t = \beta^2 \mathbf{v}{t-2} + \beta \mathbf{g}{t-1, t-2} + \mathbf{g}{t, t-1} = \ldots, = \sum_{\tau = 0}^{t-1} \beta^{\tau} \mathbf{g}_{t-\tau, t-\tau-1}. \end{aligned}$$
Large
The above reasoning formed the basis for what is now known as accelerated gradient methods, such as gradients with momentum. They enjoy the additional benefit of being much more effective in cases where the optimization problem is ill-conditioned (i.e., where there are some directions where progress is much slower than in others, resembling a narrow canyon). Furthermore, they allow us to average over subsequent gradients to obtain more stable directions of descent. Indeed, the aspect of acceleration even for noise-free convex problems is one of the key reasons why momentum works and why it works so well.
As one would expect, due to its efficacy momentum is a well-studied subject in optimization for deep learning and beyond. See e.g., the beautiful expository article by :cite:Goh.2017
for an in-depth analysis and interactive animation. It was proposed by :cite:Polyak.1964
. :cite:Nesterov.2018
has a detailed theoretical discussion in the context of convex optimization. Momentum in deep learning has been known to be beneficial for a long time. See e.g., the discussion by :cite:Sutskever.Martens.Dahl.ea.2013
for details.
To get a better understanding of the geometric properties of the momentum method we revisit gradient descent, albeit with a significantly less pleasant objective function. Recall that in :numref:sec_gd
we used
As before
%matplotlib inline
from d2l import mxnet as d2l
from mxnet import np, npx
npx.set_np()
eta = 0.4
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import torch
eta = 0.4
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf
eta = 0.4
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
By construction, the gradient in the
#@tab all
eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
The momentum method allows us to solve the gradient descent problem described
above. Looking at the optimization trace above we might intuit that averaging gradients over the past would work well. After all, in the
$$ \begin{aligned} \mathbf{v}t &\leftarrow \beta \mathbf{v}{t-1} + \mathbf{g}_{t, t-1}, \ \mathbf{x}t &\leftarrow \mathbf{x}{t-1} - \eta_t \mathbf{v}_t. \end{aligned} $$
Note that for
#@tab all
def momentum_2d(x1, x2, v1, v2):
v1 = beta * v1 + 0.2 * x1
v2 = beta * v2 + 4 * x2
return x1 - eta * v1, x2 - eta * v2, v1, v2
eta, beta = 0.6, 0.5
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
As we can see, even with the same learning rate that we used before, momentum still converges well. Let's see what happens when we decrease the momentum parameter. Halving it to
#@tab all
eta, beta = 0.6, 0.25
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
Note that we can combine momentum with stochastic gradient descent and in particular, minibatch stochastic gradient descent. The only change is that in that case we replace the gradients
Recall that $\mathbf{v}t = \sum{\tau = 0}^{t-1} \beta^{\tau} \mathbf{g}{t-\tau, t-\tau-1}$. In the limit the terms add up to $\sum{\tau=0}^\infty \beta^\tau = \frac{1}{1-\beta}$. In other words, rather than taking a step of size
#@tab all
d2l.set_figsize()
betas = [0.95, 0.9, 0.6, 0]
for beta in betas:
x = d2l.numpy(d2l.arange(40))
d2l.plt.plot(x, beta ** x, label=f'beta = {beta:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();
Let's see how momentum works in practice, i.e., when used within the context of a proper optimizer. For this we need a somewhat more scalable implementation.
Compared with (minibatch) stochastic gradient descent the momentum method needs to maintain a set of auxiliary variables, i.e., velocity. It has the same shape as the gradients (and variables of the optimization problem). In the implementation below we call these variables states
.
#@tab mxnet,pytorch
def init_momentum_states(feature_dim):
v_w = d2l.zeros((feature_dim, 1))
v_b = d2l.zeros(1)
return (v_w, v_b)
#@tab tensorflow
def init_momentum_states(features_dim):
v_w = tf.Variable(d2l.zeros((features_dim, 1)))
v_b = tf.Variable(d2l.zeros(1))
return (v_w, v_b)
def sgd_momentum(params, states, hyperparams):
for p, v in zip(params, states):
v[:] = hyperparams['momentum'] * v + p.grad
p[:] -= hyperparams['lr'] * v
#@tab pytorch
def sgd_momentum(params, states, hyperparams):
for p, v in zip(params, states):
with torch.no_grad():
v[:] = hyperparams['momentum'] * v + p.grad
p[:] -= hyperparams['lr'] * v
p.grad.data.zero_()
#@tab tensorflow
def sgd_momentum(params, grads, states, hyperparams):
for p, v, g in zip(params, states, grads):
v[:].assign(hyperparams['momentum'] * v + g)
p[:].assign(p - hyperparams['lr'] * v)
Let's see how this works in practice.
#@tab all
def train_momentum(lr, momentum, num_epochs=2):
d2l.train_ch11(sgd_momentum, init_momentum_states(feature_dim),
{'lr': lr, 'momentum': momentum}, data_iter,
feature_dim, num_epochs)
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
train_momentum(0.02, 0.5)
When we increase the momentum hyperparameter momentum
to 0.9, it amounts to a significantly larger effective sample size of
#@tab all
train_momentum(0.01, 0.9)
Reducing the learning rate further addresses any issue of non-smooth optimization problems. Setting it to
#@tab all
train_momentum(0.005, 0.9)
There is very little to do in Gluon since the standard sgd
solver already had momentum built in. Setting matching parameters yields a very similar trajectory.
d2l.train_concise_ch11('sgd', {'learning_rate': 0.005, 'momentum': 0.9},
data_iter)
#@tab pytorch
trainer = torch.optim.SGD
d2l.train_concise_ch11(trainer, {'lr': 0.005, 'momentum': 0.9}, data_iter)
#@tab tensorflow
trainer = tf.keras.optimizers.SGD
d2l.train_concise_ch11(trainer, {'learning_rate': 0.005, 'momentum': 0.9},
data_iter)
So far the 2D example of
Consider the function
This is a general quadratic function. For positive definite matrices
The gradient is given by
Since
Here
$$\mathbf{z}t = \mathbf{z}{t-1} - \boldsymbol{\Lambda} \mathbf{z}{t-1} = (\mathbf{I} - \boldsymbol{\Lambda}) \mathbf{z}{t-1}.$$
The important fact in this expression is that gradient descent does not mix between different eigenspaces. That is, when expressed in terms of the eigensystem of
$$\begin{aligned} \mathbf{v}t & = \beta \mathbf{v}{t-1} + \boldsymbol{\Lambda} \mathbf{z}{t-1} \ \mathbf{z}t & = \mathbf{z}{t-1} - \eta \left(\beta \mathbf{v}{t-1} + \boldsymbol{\Lambda} \mathbf{z}{t-1}\right) \ & = (\mathbf{I} - \eta \boldsymbol{\Lambda}) \mathbf{z}{t-1} - \eta \beta \mathbf{v}_{t-1}. \end{aligned}$$
In doing this we just proved the following theorem: Gradient Descent with and without momentum for a convex quadratic function decomposes into coordinate-wise optimization in the direction of the eigenvectors of the quadratic matrix.
Given the above result let's see what happens when we minimize the function
Whenever
#@tab all
lambdas = [0.1, 1, 10, 19]
eta = 0.1
d2l.set_figsize((6, 4))
for lam in lambdas:
t = d2l.numpy(d2l.arange(20))
d2l.plt.plot(t, (1 - eta * lam) ** t, label=f'lambda = {lam:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();
To analyze convergence in the case of momentum we begin by rewriting the update equations in terms of two scalars: one for
We used Goh.2017
for a great animation and :cite:Flammarion.Bach.2015
for a detailed analysis. One can show that
- Momentum replaces gradients with a leaky average over past gradients. This accelerates convergence significantly.
- It is desirable for both noise-free gradient descent and (noisy) stochastic gradient descent.
- Momentum prevents stalling of the optimization process that is much more likely to occur for stochastic gradient descent.
- The effective number of gradients is given by
$\frac{1}{1-\beta}$ due to exponentiated downweighting of past data. - In the case of convex quadratic problems this can be analyzed explicitly in detail.
- Implementation is quite straightforward but it requires us to store an additional state vector (momentum
$\mathbf{v}$ ).
- Use other combinations of momentum hyperparameters and learning rates and observe and analyze the different experimental results.
- Try out GD and momentum for a quadratic problem where you have multiple eigenvalues, i.e.,
$f(x) = \frac{1}{2} \sum_i \lambda_i x_i^2$ , e.g.,$\lambda_i = 2^{-i}$ . Plot how the values of$x$ decrease for the initialization$x_i = 1$ . - Derive minimum value and minimizer for
$h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b$ . - What changes when we perform stochastic gradient descent with momentum? What happens when we use minibatch stochastic gradient descent with momentum? Experiment with the parameters?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: