In this section, we will discuss the relationship between optimization and deep learning as well as the challenges of using optimization in deep learning. For a deep learning problem, we will usually define a loss function first. Once we have the loss function, we can use an optimization algorithm in attempt to minimize the loss. In optimization, a loss function is often referred to as the objective function of the optimization problem. By tradition and convention most optimization algorithms are concerned with minimization. If we ever need to maximize an objective there is a simple solution: just flip the sign on the objective.
Although optimization provides a way to minimize the loss function for deep
learning, in essence, the goals of optimization and deep learning are
fundamentally different.
The former is primarily concerned with minimizing an
objective whereas the latter is concerned with finding a suitable model, given a
finite amount of data.
In :numref:sec_model_selection
,
we discussed the difference between these two goals in detail.
For instance,
training error and generalization error generally differ: since the objective
function of the optimization algorithm is usually a loss function based on the
training dataset, the goal of optimization is to reduce the training error.
However, the goal of deep learning (or more broadly, statistical inference) is to
reduce the generalization error.
To accomplish the latter we need to pay
attention to overfitting in addition to using the optimization algorithm to
reduce the training error.
%matplotlib inline
from d2l import mxnet as d2l
from mpl_toolkits import mplot3d
from mxnet import np, npx
npx.set_np()
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import numpy as np
from mpl_toolkits import mplot3d
import torch
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import numpy as np
from mpl_toolkits import mplot3d
import tensorflow as tf
To illustrate the aforementioned different goals,
let us consider
the empirical risk and the risk.
As described
in :numref:subsec_empirical-risk-and-risk
,
the empirical risk
is an average loss
on the training dataset
while the risk is the expected loss
on the entire population of data.
Below we define two functions:
the risk function f
and the empirical risk function g
.
Suppose that we have only a finite amount of training data.
As a result, here g
is less smooth than f
.
#@tab all
def f(x):
return x * d2l.cos(np.pi * x)
def g(x):
return f(x) + 0.2 * d2l.cos(5 * np.pi * x)
The graph below illustrates that the minimum of the empirical risk on a training dataset may be at a different location from the minimum of the risk (generalization error).
#@tab all
def annotate(text, xy, xytext): #@save
d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
arrowprops=dict(arrowstyle='->'))
x = d2l.arange(0.5, 1.5, 0.01)
d2l.set_figsize((4.5, 2.5))
d2l.plot(x, [f(x), g(x)], 'x', 'risk')
annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1))
annotate('min of risk', (1.1, -1.05), (0.95, -0.5))
In this chapter, we are going to focus specifically on the performance of optimization algorithms in minimizing the objective function, rather than a
model's generalization error.
In :numref:sec_linear_regression
we distinguished between analytical solutions and numerical solutions in
optimization problems.
In deep learning, most objective functions are
complicated and do not have analytical solutions. Instead, we must use numerical
optimization algorithms.
The optimization algorithms in this chapter
all fall into this
category.
There are many challenges in deep learning optimization. Some of the most vexing ones are local minima, saddle points, and vanishing gradients. Let us have a look at them.
For any objective function
For example, given the function
we can approximate the local minimum and global minimum of this function.
#@tab all
x = d2l.arange(-1.0, 2.0, 0.01)
d2l.plot(x, [f(x), ], 'x', 'f(x)')
annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
annotate('global minimum', (1.1, -0.95), (0.6, 0.8))
The objective function of deep learning models usually has many local optima. When the numerical solution of an optimization problem is near the local optimum, the numerical solution obtained by the final iteration may only minimize the objective function locally, rather than globally, as the gradient of the objective function's solutions approaches or becomes zero. Only some degree of noise might knock the parameter out of the local minimum. In fact, this is one of the beneficial properties of minibatch stochastic gradient descent where the natural variation of gradients over minibatches is able to dislodge the parameters from local minima.
Besides local minima, saddle points are another reason for gradients to vanish. A saddle point is any location where all gradients of a function vanish but which is neither a global nor a local minimum.
Consider the function
#@tab all
x = d2l.arange(-2.0, 2.0, 0.01)
d2l.plot(x, [x**3], 'x', 'f(x)')
annotate('saddle point', (0, -0.2), (-0.52, -5.0))
Saddle points in higher dimensions are even more insidious, as the example below shows. Consider the function
#@tab all
x, y = d2l.meshgrid(
d2l.linspace(-1.0, 1.0, 101), d2l.linspace(-1.0, 1.0, 101))
z = x**2 - y**2
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');
We assume that the input of a function is a
- When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all positive, we have a local minimum for the function.
- When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all negative, we have a local maximum for the function.
- When the eigenvalues of the function's Hessian matrix at the zero-gradient position are negative and positive, we have a saddle point for the function.
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative is quite high. This makes saddle points more likely than local minima. We will discuss some exceptions to this situation in the next section when introducing convexity. In short, convex functions are those where the eigenvalues of the Hessian are never negative. Sadly, though, most deep learning problems do not fall into this category. Nonetheless it is a great tool to study optimization algorithms.
Probably the most insidious problem to encounter is the vanishing gradient.
Recall our commonly-used activation functions and their derivatives in :numref:subsec_activation-functions
.
For instance, assume that we want to minimize the function
#@tab all
x = d2l.arange(-2.0, 5.0, 0.01)
d2l.plot(x, [d2l.tanh(x)], 'x', 'f(x)')
annotate('vanishing gradient', (4, 1), (2, 0.0))
As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust range of algorithms that perform well and that are easy to use even for beginners. Furthermore, it is not really necessary to find the best solution. Local optima or even approximate solutions thereof are still very useful.
- Minimizing the training error does not guarantee that we find the best set of parameters to minimize the generalization error.
- The optimization problems may have many local minima.
- The problem may have even more saddle points, as generally the problems are not convex.
- Vanishing gradients can cause optimization to stall. Often a reparameterization of the problem helps. Good initialization of the parameters can be beneficial, too.
- Consider a simple MLP with a single hidden layer of, say,
$d$ dimensions in the hidden layer and a single output. Show that for any local minimum there are at least$d!$ equivalent solutions that behave identically. - Assume that we have a symmetric random matrix
$\mathbf{M}$ where the entries$M_{ij} = M_{ji}$ are each drawn from some probability distribution$p_{ij}$ . Furthermore assume that$p_{ij}(x) = p_{ij}(-x)$ , i.e., that the distribution is symmetric (see e.g., :cite:Wigner.1958
for details).- Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvector
$\mathbf{v}$ the probability that the associated eigenvalue$\lambda$ satisfies$P(\lambda > 0) = P(\lambda < 0)$ . - Why does the above not imply
$P(\lambda > 0) = 0.5$ ?
- Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvector
- What other challenges involved in deep learning optimization can you think of?
- Assume that you want to balance a (real) ball on a (real) saddle.
- Why is this hard?
- Can you exploit this effect also for optimization algorithms?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: