🏷️sec_adadelta
Adadelta is yet another variant of AdaGrad (:numref:sec_adagrad
). The main difference lies in the fact that it decreases the amount by which the learning rate is adaptive to coordinates. Moreover, traditionally it referred to as not having a learning rate since it uses the amount of change itself as calibration for future change. The algorithm was proposed in :cite:Zeiler.2012
. It is fairly straightforward, given the discussion of previous algorithms so far.
In a nutshell, Adadelta uses two state variables,
Here are the technical details of Adadelta. Given the parameter du jour is sec_rmsprop
:
$$\begin{aligned} \mathbf{s}t & = \rho \mathbf{s}{t-1} + (1 - \rho) \mathbf{g}_t^2. \end{aligned}$$
The difference to :numref:sec_rmsprop
is that we perform updates with the rescaled gradient
$$\begin{aligned} \mathbf{x}t & = \mathbf{x}{t-1} - \mathbf{g}_t'. \ \end{aligned}$$
So what is the rescaled gradient
$$\begin{aligned} \mathbf{g}t' & = \frac{\sqrt{\Delta\mathbf{x}{t-1} + \epsilon}}{\sqrt{{\mathbf{s}_t + \epsilon}}} \odot \mathbf{g}_t, \ \end{aligned}$$
where
$$\begin{aligned} \Delta \mathbf{x}t & = \rho \Delta\mathbf{x}{t-1} + (1 - \rho) {\mathbf{g}_t'}^2, \end{aligned}$$
and
Adadelta needs to maintain two state variables for each variable,
%matplotlib inline
from d2l import mxnet as d2l
from mxnet import np, npx
npx.set_np()
def init_adadelta_states(feature_dim):
s_w, s_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
delta_w, delta_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
# In-place updates via [:]
s[:] = rho * s + (1 - rho) * np.square(p.grad)
g = (np.sqrt(delta + eps) / np.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import torch
def init_adadelta_states(feature_dim):
s_w, s_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
delta_w, delta_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
with torch.no_grad():
# In-place updates via [:]
s[:] = rho * s + (1 - rho) * torch.square(p.grad)
g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
p.grad.data.zero_()
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf
def init_adadelta_states(feature_dim):
s_w = tf.Variable(d2l.zeros((feature_dim, 1)))
s_b = tf.Variable(d2l.zeros(1))
delta_w = tf.Variable(d2l.zeros((feature_dim, 1)))
delta_b = tf.Variable(d2l.zeros(1))
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, grads, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta), grad in zip(params, states, grads):
s[:].assign(rho * s + (1 - rho) * tf.math.square(grad))
g = (tf.math.sqrt(delta + eps) / tf.math.sqrt(s + eps)) * grad
p[:].assign(p - g)
delta[:].assign(rho * delta + (1 - rho) * g * g)
Choosing
#@tab all
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adadelta, init_adadelta_states(feature_dim),
{'rho': 0.9}, data_iter, feature_dim);
For a concise implementation we simply use the adadelta
algorithm from the Trainer
class. This yields the following one-liner for a much more compact invocation.
d2l.train_concise_ch11('adadelta', {'rho': 0.9}, data_iter)
#@tab pytorch
trainer = torch.optim.Adadelta
d2l.train_concise_ch11(trainer, {'rho': 0.9}, data_iter)
#@tab tensorflow
# adadelta is not converging at default learning rate
# but it's converging at lr = 5.0
trainer = tf.keras.optimizers.Adadelta
d2l.train_concise_ch11(trainer, {'learning_rate':5.0, 'rho': 0.9}, data_iter)
- Adadelta has no learning rate parameter. Instead, it uses the rate of change in the parameters itself to adapt the learning rate.
- Adadelta requires two state variables to store the second moments of gradient and the change in parameters.
- Adadelta uses leaky averages to keep a running estimate of the appropriate statistics.
- Adjust the value of
$\rho$ . What happens? - Show how to implement the algorithm without the use of
$\mathbf{g}_t'$ . Why might this be a good idea? - Is Adadelta really learning rate free? Could you find optimization problems that break Adadelta?
- Compare Adadelta to Adagrad and RMS prop to discuss their convergence behavior.
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: