diff --git a/Ch12_Optimization_Algorithms/Adadelta.ipynb b/Ch12_Optimization_Algorithms/Adadelta.ipynb new file mode 100644 index 00000000..3f966c44 --- /dev/null +++ b/Ch12_Optimization_Algorithms/Adadelta.ipynb @@ -0,0 +1,170 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Adadelta" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition to RMSProp, Adadelta is another common optimization algorithm that\n", + "helps improve the chances of finding useful solutions at later stages of\n", + "iteration, which is difficult to do when using the Adagrad algorithm for the\n", + "same purpose :cite:`Zeiler.2012`. The interesting thing is that there is no learning rate\n", + "hyperparameter in the Adadelta algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The Algorithm\n", + "\n", + "Like RMSProp, the Adadelta algorithm uses the variable $\\boldsymbol{s}_t$, which is an EWMA on the squares of elements in mini-batch stochastic gradient $\\boldsymbol{g}_t$. At time step 0, all the elements are initialized to 0.\n", + "Given the hyperparameter $0 \\leq \\rho < 1$ (counterpart of $\\gamma$ in RMSProp), at time step $t>0$, compute using the same method as RMSProp:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "$$\\boldsymbol{s}_t \\leftarrow \\rho \\boldsymbol{s}_{t-1} + (1 - \\rho) \\boldsymbol{g}_t \\odot \\boldsymbol{g}_t. $$\n", + "\n", + "Unlike RMSProp, Adadelta maintains an additional state variable, $\\Delta\\boldsymbol{x}_t$ the elements of which are also initialized to 0 at time step 0. We use $\\Delta\\boldsymbol{x}_{t-1}$ to compute the variation of the independent variable:\n", + "\n", + "$$ \\boldsymbol{g}_t' \\leftarrow \\sqrt{\\frac{\\Delta\\boldsymbol{x}_{t-1} + \\epsilon}{\\boldsymbol{s}_t + \\epsilon}} \\odot \\boldsymbol{g}_t, $$\n", + "\n", + "Here, $\\epsilon$ is a constant added to maintain the numerical stability, such as $10^{-5}$. Next, we update the independent variable:\n", + "\n", + "$$\\boldsymbol{x}_t \\leftarrow \\boldsymbol{x}_{t-1} - \\boldsymbol{g}'_t. $$\n", + "\n", + "Finally, we use $\\Delta\\boldsymbol{x}$ to record the EWMA on the squares of elements in $\\boldsymbol{g}'$, which is the variation of the independent variable.\n", + "\n", + "$$\\Delta\\boldsymbol{x}_t \\leftarrow \\rho \\Delta\\boldsymbol{x}_{t-1} + (1 - \\rho) \\boldsymbol{g}'_t \\odot \\boldsymbol{g}'_t. $$\n", + "\n", + "As we can see, if the impact of $\\epsilon$ is not considered here, Adadelta differs from RMSProp in its replacement of the hyperparameter $\\eta$ with $\\sqrt{\\Delta\\boldsymbol{x}_{t-1}}$.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implementation from Scratch\n", + "\n", + "Adadelta needs to maintain two state variables for each independent variable, $\\boldsymbol{s}_t$ and $\\Delta\\boldsymbol{x}_t$. We use the formula from the algorithm to implement Adadelta." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "import d2l\n", + "from d2l import load_array\n", + "\n", + "def init_adadelta_states(feature_dim):\n", + " s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)\n", + " delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1)\n", + " return ((s_w, delta_w), (s_b, delta_b))\n", + "\n", + "def adadelta(params, states, hyperparams):\n", + " rho, eps = hyperparams['rho'], 1e-5\n", + " for p, (s, delta) in zip(params, states):\n", + " p = p.type(torch.FloatTensor)\n", + " s[:] = rho * s + ((1 - rho) * p* p)\n", + " g = ((delta + eps).sqrt() / (s + eps).sqrt()) * (p)\n", + " p[:] -= g\n", + " delta[:] = rho * delta + (1 - rho) * g * g" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, we train the model with the hyperparameter $\\rho=0.9$." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "loss: 0.259, 0.075 sec/epoch\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "data_iter, feature_dim = d2l.get_data_ch10(batch_size=10)\n", + "d2l.train_ch10(torch.optim.Adadelta, {'rho': 0.9}, data_iter, feature_dim);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "* Adadelta has no learning rate hyperparameter, it uses an EWMA on the squares of elements in the variation of the independent variable to replace the learning rate.\n", + "\n", + "## Exercises\n", + "\n", + "* Adjust the value of $\\rho$ and observe the experimental results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}