🏷️sec_adagrad
Let's begin by considering learning problems with features that occur infrequently.
Imagine that we are training a language model. To get good accuracy we typically want to decrease the learning rate as we keep on training, usually at a rate of
Parameters associated with infrequent features only receive meaningful updates whenever these features occur. Given a decreasing learning rate we might end up in a situation where the parameters for common features converge rather quickly to their optimal values, whereas for infrequent features we are still short of observing them sufficiently frequently before their optimal values can be determined. In other words, the learning rate either decreases too slowly for frequent features or too quickly for infrequent ones.
A possible hack to redress this issue would be to count the number of times we see a particular feature and to use this as a clock for adjusting learning rates. That is, rather than choosing a learning rate of the form
Adagrad by :cite:Duchi.Hazan.Singer.2011
addresses this by replacing the rather crude counter
Convex optimization problems are good for analyzing the characteristics of algorithms. After all, for most nonconvex problems it is difficult to derive meaningful theoretical guarantees, but intuition and insight often carry over. Let's look at the problem of minimizing
As we saw in :numref:sec_momentum
, it is possible to rewrite this problem in terms of its eigendecomposition
Here we used
If we perturb
If the condition number
While computing eigenvalues exactly might be expensive, guessing them and computing them even somewhat approximately may already be a lot better than not doing anything at all. In particular, we could use the diagonal entries of
In this case we have $\tilde{\mathbf{Q}}{ij} = \mathbf{Q}{ij} / \sqrt{\mathbf{Q}{ii} \mathbf{Q}{jj}}$ and specifically
Unfortunately we face yet another problem: in deep learning we typically do not even have access to the second derivative of the objective function: for
In order to see why this works, let's look at
where Duchi.Hazan.Singer.2011
for details.
Let's formalize the discussion from above. We use the variable
$$\begin{aligned} \mathbf{g}t & = \partial{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w})), \ \mathbf{s}t & = \mathbf{s}{t-1} + \mathbf{g}_t^2, \ \mathbf{w}t & = \mathbf{w}{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t. \end{aligned}$$
Here the operation are applied coordinate wise. That is,
Just like in the case of momentum we need to keep track of an auxiliary variable, in this case to allow for an individual learning rate per coordinate. This does not increase the cost of Adagrad significantly relative to SGD, simply since the main cost is typically to compute
Note that accumulating squared gradients in
We are going to implement Adagrad using the same learning rate previously, i.e.,
%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np, npx
npx.set_np()
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import math
import torch
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import math
import tensorflow as tf
#@tab all
def adagrad_2d(x1, x2, s1, s2):
eps = 1e-6
g1, g2 = 0.2 * x1, 4 * x2
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
As we increase the learning rate to
#@tab all
eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
Just like the momentum method, Adagrad needs to maintain a state variable of the same shape as the parameters.
def init_adagrad_states(feature_dim):
s_w = d2l.zeros((feature_dim, 1))
s_b = d2l.zeros(1)
return (s_w, s_b)
def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
s[:] += np.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / np.sqrt(s + eps)
#@tab pytorch
def init_adagrad_states(feature_dim):
s_w = d2l.zeros((feature_dim, 1))
s_b = d2l.zeros(1)
return (s_w, s_b)
def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
with torch.no_grad():
s[:] += torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()
#@tab tensorflow
def init_adagrad_states(feature_dim):
s_w = tf.Variable(d2l.zeros((feature_dim, 1)))
s_b = tf.Variable(d2l.zeros(1))
return (s_w, s_b)
def adagrad(params, grads, states, hyperparams):
eps = 1e-6
for p, s, g in zip(params, states, grads):
s[:].assign(s + tf.math.square(g))
p[:].assign(p - hyperparams['lr'] * g / tf.math.sqrt(s + eps))
Compared to the experiment in :numref:sec_minibatch_sgd
we use a
larger learning rate to train the model.
#@tab all
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adagrad, init_adagrad_states(feature_dim),
{'lr': 0.1}, data_iter, feature_dim);
Using the Trainer
instance of the algorithm adagrad
, we can invoke the Adagrad algorithm in Gluon.
d2l.train_concise_ch11('adagrad', {'learning_rate': 0.1}, data_iter)
#@tab pytorch
trainer = torch.optim.Adagrad
d2l.train_concise_ch11(trainer, {'lr': 0.1}, data_iter)
#@tab tensorflow
trainer = tf.keras.optimizers.Adagrad
d2l.train_concise_ch11(trainer, {'learning_rate' : 0.1}, data_iter)
- Adagrad decreases the learning rate dynamically on a per-coordinate basis.
- It uses the magnitude of the gradient as a means of adjusting how quickly progress is achieved - coordinates with large gradients are compensated with a smaller learning rate.
- Computing the exact second derivative is typically infeasible in deep learning problems due to memory and computational constraints. The gradient can be a useful proxy.
- If the optimization problem has a rather uneven structure Adagrad can help mitigate the distortion.
- Adagrad is particularly effective for sparse features where the learning rate needs to decrease more slowly for infrequently occurring terms.
- On deep learning problems Adagrad can sometimes be too aggressive in reducing learning rates. We will discuss strategies for mitigating this in the context of :numref:
sec_adam
.
- Prove that for an orthogonal matrix
$\mathbf{U}$ and a vector$\mathbf{c}$ the following holds:$|\mathbf{c} - \mathbf{\delta}|_2 = |\mathbf{U} \mathbf{c} - \mathbf{U} \mathbf{\delta}|_2$ . Why does this mean that the magnitude of perturbations does not change after an orthogonal change of variables? - Try out Adagrad for
$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$ and also for the objective function was rotated by 45 degrees, i.e.,$f(\mathbf{x}) = 0.1 (x_1 + x_2)^2 + 2 (x_1 - x_2)^2$ . Does it behave differently? - Prove Gerschgorin's circle theorem which states that eigenvalues
$\lambda_i$ of a matrix$\mathbf{M}$ satisfy $|\lambda_i - \mathbf{M}{jj}| \leq \sum{k \neq j} |\mathbf{M}_{jk}|$ for at least one choice of$j$ . - What does Gerschgorin's theorem tell us about the eigenvalues of the diagonally preconditioned matrix
$\mathrm{diag}^{-\frac{1}{2}}(\mathbf{M}) \mathbf{M} \mathrm{diag}^{-\frac{1}{2}}(\mathbf{M})$ ? - Try out Adagrad for a proper deep network, such as :numref:
sec_lenet
when applied to Fashion-MNIST. - How would you need to modify Adagrad to achieve a less aggressive decay in learning rate?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: