add opt first 4 sections

muc-education · Jul 26, 2021 · 32483d7 · 32483d7
1 parent 888e837
commit 32483d7
Show file tree

Hide file tree

Showing 10 changed files with 2,161 additions and 83 deletions.
diff --git a/chapter_optimization/convexity.md b/chapter_optimization/convexity.md
diff --git a/chapter_optimization/convexity_origin.md b/chapter_optimization/convexity_origin.md
diff --git a/chapter_optimization/gd.md b/chapter_optimization/gd.md
diff --git a/chapter_optimization/gd_origin.md b/chapter_optimization/gd_origin.md
diff --git a/chapter_optimization/index.md b/chapter_optimization/index.md
@@ -1,18 +1,11 @@
 # 优化算法
 :label:`chap_optimization`
 
-到目前为止，如果你按顺序阅读本书，你已经学会使用许多优化算法来训练深度学习模型。
-它们是允许我们继续更新模型参数和最小化损失函数值的工具。
-的确，很多人都愿意将优化视为“黑盒设备”，拥有一些使用深度学习优化“魔法”的知识，就能够基于简单的设置实现目标函数的最小化。
+如果您在此之前按顺序阅读这本书，则已经使用了许多优化算法来训练深度学习模型。这些工具使我们能够继续更新模型参数并最大限度地减少损失函数的价值，正如培训集评估的那样。事实上，任何人满意将优化视为黑盒装置，以便在简单的环境中最大限度地减少客观功能，都可能会知道存在着一系列此类程序的咒语（名称如 “SGD” 和 “亚当”）。 
 
-然而，优化算法对于深度学习是很重要的，因此学习一些更深层次的知识可以更好地优化。
-一方面，训练一个复杂的深度学习模型可能需要数小时、数天甚至数周的时间，而优化算法的性能将直接影响模型的训练效率。
-另一方面，了解不同优化算法的原理及其超参数的作用，可以有针对性地调整超参数，提高深度学习模型的性能。
+但是，为了做得好，还需要更深入的知识。优化算法对于深度学习非常重要。一方面，训练复杂的深度学习模型可能需要数小时、几天甚至数周。优化算法的性能直接影响模型的训练效率。另一方面，了解不同优化算法的原则及其超参数的作用将使我们能够以有针对性的方式调整超参数，以提高深度学习模型的性能。 
 
-在本章中，我们将深入探讨常见的深度学习优化算法。
-在深度学习中，几乎所有的优化问题都是 *非凸的*（nonconvex）。
-尽管如此，在 *凸问题* 的背景下设计和分析算法已经被证明是非常有益的。
-基于这个原因，本章包括了关于凸优化的入门，和一个非常简单的随机梯度下降算法在凸目标函数上的证明。
+在本章中，我们深入探讨常见的深度学习优化算法。深度学习中出现的几乎所有优化问题都是 * nonconvex*。尽管如此，在 *CONVex* 问题背景下设计和分析算法是非常有启发性的。正是出于这个原因，本章包括了凸优化的入门，以及凸目标函数上非常简单的随机梯度下降算法的证明。
 
 ```toc
 :maxdepth: 2
@@ -28,4 +21,4 @@ rmsprop
 adadelta
 adam
 lr-scheduler
-```
+```
diff --git a/chapter_optimization/index_origin.md b/chapter_optimization/index_origin.md
@@ -0,0 +1,34 @@
+# Optimization Algorithms
+:label:`chap_optimization`
+
+If you read the book in sequence up to this point you already used a number of optimization algorithms to train deep learning models.
+They were the tools that allowed us to continue updating model parameters and to minimize the value of the loss function, as evaluated on the training set. Indeed, anyone content with treating optimization as a black box device to minimize objective functions in a simple setting might well content oneself with the knowledge that there exists an array of incantations of such a procedure (with names such as "SGD" and "Adam").
+
+To do well, however, some deeper knowledge is required.
+Optimization algorithms are important for deep learning.
+On one hand, training a complex deep learning model can take hours, days, or even weeks.
+The performance of the optimization algorithm directly affects the model's training efficiency.
+On the other hand, understanding the principles of different optimization algorithms and the role of their hyperparameters
+will enable us to tune the hyperparameters in a targeted manner to improve the performance of deep learning models.
+
+In this chapter, we explore common deep learning optimization algorithms in depth.
+Almost all optimization problems arising in deep learning are *nonconvex*.
+Nonetheless, the design and analysis of algorithms in the context of *convex* problems have proven to be very instructive.
+It is for that reason that this chapter includes a primer on convex optimization and the proof for a very simple stochastic gradient descent algorithm on a convex objective function.
+
+```toc
+:maxdepth: 2
+
+optimization-intro
+convexity
+gd
+sgd
+minibatch-sgd
+momentum
+adagrad
+rmsprop
+adadelta
+adam
+lr-scheduler
+```
+
diff --git a/chapter_optimization/optimization-intro.md b/chapter_optimization/optimization-intro.md
diff --git a/chapter_optimization/optimization-intro_origin.md b/chapter_optimization/optimization-intro_origin.md
@@ -0,0 +1,231 @@
+# Optimization and Deep Learning
+
+In this section, we will discuss the relationship between optimization and deep learning as well as the challenges of using optimization in deep learning.
+For a deep learning problem, we will usually define a *loss function* first. Once we have the loss function, we can use an optimization algorithm in attempt to minimize the loss.
+In optimization, a loss function is often referred to as the *objective function* of the optimization problem. By tradition and convention most optimization algorithms are concerned with *minimization*. If we ever need to maximize an objective there is a simple solution: just flip the sign on the objective.
+
+## Goal of Optimization
+
+Although optimization provides a way to minimize the loss function for deep
+learning, in essence, the goals of optimization and deep learning are
+fundamentally different.
+The former is primarily concerned with minimizing an
+objective whereas the latter is concerned with finding a suitable model, given a
+finite amount of data.
+In :numref:`sec_model_selection`,
+we discussed the difference between these two goals in detail.
+For instance,
+training error and generalization error generally differ: since the objective
+function of the optimization algorithm is usually a loss function based on the
+training dataset, the goal of optimization is to reduce the training error.
+However, the goal of deep learning (or more broadly, statistical inference) is to
+reduce the generalization error.
+To accomplish the latter we need to pay
+attention to overfitting in addition to using the optimization algorithm to
+reduce the training error.
+
+```{.python .input}
+%matplotlib inline
+from d2l import mxnet as d2l
+from mpl_toolkits import mplot3d
+from mxnet import np, npx
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import numpy as np
+from mpl_toolkits import mplot3d
+import torch
+```
+
+```{.python .input}
+#@tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import numpy as np
+from mpl_toolkits import mplot3d
+import tensorflow as tf
+```
+
+To illustrate the aforementioned different goals,
+let us consider 
+the empirical risk and the risk. 
+As described
+in :numref:`subsec_empirical-risk-and-risk`,
+the empirical risk
+is an average loss
+on the training dataset
+while the risk is the expected loss 
+on the entire population of data.
+Below we define two functions:
+the risk function `f`
+and the empirical risk function `g`.
+Suppose that we have only a finite amount of training data.
+As a result, here `g` is less smooth than `f`.
+
+```{.python .input}
+#@tab all
+def f(x):
+    return x * d2l.cos(np.pi * x)
+
+def g(x):
+    return f(x) + 0.2 * d2l.cos(5 * np.pi * x)
+```
+
+The graph below illustrates that the minimum of the empirical risk on a training dataset may be at a different location from the minimum of the risk (generalization error).
+
+```{.python .input}
+#@tab all
+def annotate(text, xy, xytext):  #@save
+    d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
+                           arrowprops=dict(arrowstyle='->'))
+
+x = d2l.arange(0.5, 1.5, 0.01)
+d2l.set_figsize((4.5, 2.5))
+d2l.plot(x, [f(x), g(x)], 'x', 'risk')
+annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1))
+annotate('min of risk', (1.1, -1.05), (0.95, -0.5))
+```
+
+## Optimization Challenges in Deep Learning
+
+In this chapter, we are going to focus specifically on the performance of optimization algorithms in minimizing the objective function, rather than a
+model's generalization error. 
+In :numref:`sec_linear_regression`
+we distinguished between analytical solutions and numerical solutions in
+optimization problems. 
+In deep learning, most objective functions are
+complicated and do not have analytical solutions. Instead, we must use numerical
+optimization algorithms. 
+The optimization algorithms in this chapter
+all fall into this
+category.
+
+There are many challenges in deep learning optimization. Some of the most vexing ones are local minima, saddle points, and vanishing gradients. 
+Let us have a look at them.
+
+
+### Local Minima
+
+For any objective function $f(x)$,
+if the value of $f(x)$ at $x$ is smaller than the values of $f(x)$ at any other points in the vicinity of $x$, then $f(x)$ could be a local minimum.
+If the value of $f(x)$ at $x$ is the minimum of the objective function over the entire domain,
+then $f(x)$ is the global minimum.
+
+For example, given the function
+
+$$f(x) = x \cdot \text{cos}(\pi x) \text{ for } -1.0 \leq x \leq 2.0,$$
+
+we can approximate the local minimum and global minimum of this function.
+
+```{.python .input}
+#@tab all
+x = d2l.arange(-1.0, 2.0, 0.01)
+d2l.plot(x, [f(x), ], 'x', 'f(x)')
+annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
+annotate('global minimum', (1.1, -0.95), (0.6, 0.8))
+```
+
+The objective function of deep learning models usually has many local optima. 
+When the numerical solution of an optimization problem is near the local optimum, the numerical solution obtained by the final iteration may only minimize the objective function *locally*, rather than *globally*, as the gradient of the objective function's solutions approaches or becomes zero. 
+Only some degree of noise might knock the parameter out of the local minimum. In fact, this is one of the beneficial properties of
+minibatch stochastic gradient descent where the natural variation of gradients over minibatches is able to dislodge the parameters from local minima.
+
+
+### Saddle Points
+
+Besides local minima, saddle points are another reason for gradients to vanish. A *saddle point* is any location where all gradients of a function vanish but which is neither a global nor a local minimum. 
+Consider the function $f(x) = x^3$. Its first and second derivative vanish for $x=0$. Optimization might stall at this point, even though it is not a minimum.
+
+```{.python .input}
+#@tab all
+x = d2l.arange(-2.0, 2.0, 0.01)
+d2l.plot(x, [x**3], 'x', 'f(x)')
+annotate('saddle point', (0, -0.2), (-0.52, -5.0))
+```
+
+Saddle points in higher dimensions are even more insidious, as the example below shows. Consider the function $f(x, y) = x^2 - y^2$. It has its saddle point at $(0, 0)$. This is a maximum with respect to $y$ and a minimum with respect to $x$. Moreover, it *looks* like a saddle, which is where this mathematical property got its name.
+
+```{.python .input}
+#@tab all
+x, y = d2l.meshgrid(
+    d2l.linspace(-1.0, 1.0, 101), d2l.linspace(-1.0, 1.0, 101))
+z = x**2 - y**2
+
+ax = d2l.plt.figure().add_subplot(111, projection='3d')
+ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
+ax.plot([0], [0], [0], 'rx')
+ticks = [-1, 0, 1]
+d2l.plt.xticks(ticks)
+d2l.plt.yticks(ticks)
+ax.set_zticks(ticks)
+d2l.plt.xlabel('x')
+d2l.plt.ylabel('y');
+```
+
+We assume that the input of a function is a $k$-dimensional vector and its
+output is a scalar, so its Hessian matrix will have $k$ eigenvalues
+(refer to the [online appendix on eigendecompositions](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/eigendecomposition.html)).
+The solution of the
+function could be a local minimum, a local maximum, or a saddle point at a
+position where the function gradient is zero:
+
+* When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all positive, we have a local minimum for the function.
+* When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all negative, we have a local maximum for the function.
+* When the eigenvalues of the function's Hessian matrix at the zero-gradient position are negative and positive, we have a saddle point for the function.
+
+For high-dimensional problems the likelihood that at least *some* of the eigenvalues are negative is quite high. This makes saddle points more likely than local minima. We will discuss some exceptions to this situation in the next section when introducing convexity. In short, convex functions are those where the eigenvalues of the Hessian are never negative. Sadly, though, most deep learning problems do not fall into this category. Nonetheless it is a great tool to study optimization algorithms.
+
+### Vanishing Gradients
+
+Probably the most insidious problem to encounter is the vanishing gradient.
+Recall our commonly-used activation functions and their derivatives in :numref:`subsec_activation-functions`.
+For instance, assume that we want to minimize the function $f(x) = \tanh(x)$ and we happen to get started at $x = 4$. As we can see, the gradient of $f$ is close to nil.
+More specifically, $f'(x) = 1 - \tanh^2(x)$ and thus $f'(4) = 0.0013$.
+Consequently, optimization will get stuck for a long time before we make progress. This turns out to be one of the reasons that training deep learning models was quite tricky prior to the introduction of the ReLU activation function.
+
+```{.python .input}
+#@tab all
+x = d2l.arange(-2.0, 5.0, 0.01)
+d2l.plot(x, [d2l.tanh(x)], 'x', 'f(x)')
+annotate('vanishing gradient', (4, 1), (2, 0.0))
+```
+
+As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust range of algorithms that perform well and that are easy to use even for beginners. Furthermore, it is not really necessary to find *the* best solution. Local optima or even approximate solutions thereof are still very useful.
+
+## Summary
+
+* Minimizing the training error does *not* guarantee that we find the best set of parameters to minimize the generalization error.
+* The optimization problems may have many local minima.
+* The problem may have even more saddle points, as generally the problems are not convex.
+* Vanishing gradients can cause optimization to stall. Often a reparameterization of the problem helps. Good initialization of the parameters can be beneficial, too.
+
+
+## Exercises
+
+1. Consider a simple MLP with a single hidden layer of, say, $d$ dimensions in the hidden layer and a single output. Show that for any local minimum there are at least $d!$ equivalent solutions that behave identically.
+1. Assume that we have a symmetric random matrix $\mathbf{M}$ where the entries
+   $M_{ij} = M_{ji}$ are each drawn from some probability distribution
+   $p_{ij}$. Furthermore assume that $p_{ij}(x) = p_{ij}(-x)$, i.e., that the
+   distribution is symmetric (see e.g., :cite:`Wigner.1958` for details).
+    1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvector $\mathbf{v}$ the probability that the associated eigenvalue $\lambda$ satisfies $P(\lambda > 0) = P(\lambda < 0)$.
+    1. Why does the above *not* imply $P(\lambda > 0) = 0.5$?
+1. What other challenges involved in deep learning optimization can you think of?
+1. Assume that you want to balance a (real) ball on a (real) saddle.
+    1. Why is this hard?
+    1. Can you exploit this effect also for optimization algorithms?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/349)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/487)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/489)
+:end_tab: