diff --git a/chapter_optimization/convexity.md b/chapter_optimization/convexity.md new file mode 100644 index 000000000..d71661fdc --- /dev/null +++ b/chapter_optimization/convexity.md @@ -0,0 +1,268 @@ +# Conexity +:label:`sec_convexity` + +Conexity 在优化算法的设计中起着至关重要的作用。这在很大程度上是因为在这种情况下分析和测试算法要容易得多。换句话说,如果算法即使在凸设置中也表现不佳,那么通常我们不应该希望看到很好的结果。此外,尽管深度学习中的优化问题通常是非凸出的,但它们往往表现出接近局部最小值的凸问题的一些特性。这可能会导致令人兴奋的新优化变体,例如 :cite:`Izmailov.Podoprikhin.Garipov.ea.2018`。 + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mpl_toolkits import mplot3d +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import numpy as np +from mpl_toolkits import mplot3d +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import numpy as np +from mpl_toolkits import mplot3d +import tensorflow as tf +``` + +## 定义 + +在凸分析之前,我们需要定义 * 凸集 * 和 * 凸函数 *。它们导致了通常应用于机器学习的数学工具。 + +### 凸集 + +套装是凸度的基础。简而言之,如果对于任何 $a, b \in \mathcal{X}$,连接 $a$ 和 $b$ 的线段也是 $\mathcal{X}$,则矢量空间中的一组 $\mathcal{X}$ 为 * 凸 X *。从数学角度来说,这意味着对于所有 $\lambda \in [0, 1]$ 我们都有 + +$$\lambda a + (1-\lambda) b \in \mathcal{X} \text{ whenever } a, b \in \mathcal{X}.$$ + +这听起来有点抽象。考虑 :numref:`fig_pacman`。第一组不是凸面的,因为存在不包含在其中的线段。另外两套没有遇到这样的问题。 + +![The first set is nonconvex and the other two are convex.](../img/pacman.svg) +:label:`fig_pacman` + +除非你能用它们做点什么,否则自己的定义并不是特别有用。在这种情况下,我们可以查看 :numref:`fig_convex_intersect` 所示的十字路口。假设 $\mathcal{X}$ 和 $\mathcal{Y}$ 是凸集。然后 $\mathcal{X} \cap \mathcal{Y}$ 也是凸起来的。要看到这一点,请考虑任何 $a, b \in \mathcal{X} \cap \mathcal{Y}$。由于 $\mathcal{X}$ 和 $\mathcal{Y}$ 是凸的,所以连接 $a$ 和 $b$ 的线段都包含在 $\mathcal{X}$ 和 $\mathcal{Y}$ 中。鉴于这一点,它们也需要包含在 $\mathcal{X} \cap \mathcal{Y}$ 中,从而证明我们的定理。 + +![The intersection between two convex sets is convex.](../img/convex-intersect.svg) +:label:`fig_convex_intersect` + +我们可以很少努力加强这一结果:鉴于凸集 $\mathcal{X}_i$,它们的交叉点 $\cap_{i} \mathcal{X}_i$ 是凸的。要看看相反的情况是不正确的,请考虑两套不相交的 $\mathcal{X} \cap \mathcal{Y} = \emptyset$。现在选择 $a \in \mathcal{X}$ 和 $b \in \mathcal{Y}$。:numref:`fig_nonconvex` 中连接 $a$ 和 $a$ 和 $b$ 的线段需要包含一些既不在 $\mathcal{X}$ 中也不是 $\mathcal{Y}$ 中的部分,因为我们假设为 $\mathcal{X} \cap \mathcal{Y} = \emptyset$。因此,直线段也不在 $\mathcal{X} \cup \mathcal{Y}$ 中,从而证明了一般来说凸集的并集不需要凸起。 + +![The union of two convex sets need not be convex.](../img/nonconvex.svg) +:label:`fig_nonconvex` + +通常,深度学习中的问题是在凸集上定义的。例如,$\mathbb{R}^d$ 是一组 $d$ 维矢量的实数,是一个凸集(毕竟,$\mathbb{R}^d$ 中任意两个点之间的线保留在 $\mathbb{R}^d$ 中)。在某些情况下,我们使用有限长度的变量,例如 $\{\mathbf{x} | \mathbf{x} \in \mathbb{R}^d \text{ and } \|\mathbf{x}\| \leq r\}$ 定义的半径为 $r$ 的球。 + +### 凸函数 + +现在我们已经有了凸集,我们可以引入 * 凸函数 * $f$。给定凸集 $\mathcal{X}$,函数 $f: \mathcal{X} \to \mathbb{R}$ 是 * 凸面 * 如果对于所有 $x, x' \in \mathcal{X}$ 和所有 $\lambda \in [0, 1]$ 我们都有 + +$$\lambda f(x) + (1-\lambda) f(x') \geq f(\lambda x + (1-\lambda) x').$$ + +为了说明这一点,让我们绘制一些函数并检查哪些功能符合要求。下面我们定义了一些函数,包括凸和非凸。 + +```{.python .input} +#@tab all +f = lambda x: 0.5 * x**2 # Convex +g = lambda x: d2l.cos(np.pi * x) # Nonconvex +h = lambda x: d2l.exp(0.5 * x) # Convex + +x, segment = d2l.arange(-2, 2, 0.01), d2l.tensor([-1.5, 1]) +d2l.use_svg_display() +_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3)) +for ax, func in zip(axes, [f, g, h]): + d2l.plot([x, segment], [func(x), func(segment)], axes=ax) +``` + +正如预期的那样,余弦函数是 * nonconvex*,而抛物线和指数函数是。请注意,要使条件有意义,需要 $\mathcal{X}$ 是凸集的要求。否则,$f(\lambda x + (1-\lambda) x')$ 的结果可能没有很好的界定。 + +### Jensen 的不平等 + +鉴于凸函数 $f$,最有用的数学工具之一是 * Jensen 的不平等性 *。这相当于对凸度定义的概括: + +$$\sum_i \alpha_i f(x_i) \geq f\left(\sum_i \alpha_i x_i\right) \text{ and } E_X[f(X)] \geq f\left(E_X[X]\right),$$ +:eqlabel:`eq_jensens-inequality` + +其中 $\alpha_i$ 是非负实数,因此 $\sum_i \alpha_i = 1$ 和 $X$ 是一个随机变量。换句话说,对凸函数的期望不低于期望的凸函数,后者通常是一个更简单的表达式。为了证明第一个不平等,我们一次将凸度的定义应用于总和中的一个术语。 + +延森不平等的常见应用之一是用一个更简单的表达来限制一个更复杂的表达方式。例如,它的应用可以是部分观察到的随机变量的对数可能性。也就是说,我们使用 + +$$E_{Y \sim P(Y)}[-\log P(X \mid Y)] \geq -\log P(X),$$ + +自 $\int P(Y) P(X \mid Y) dY = P(X)$ 以来。这可以在变分方法中使用。这里 $Y$ 通常是未观察到的随机变量,$P(Y)$ 是对它可能如何分布的最佳猜测,$P(X)$ 是集成了 $Y$ 的分布。例如,在群集中,$Y$ 可能是集群标签,$P(X \mid Y)$ 是应用集群标签时的生成模型。 + +## 属性 + +凸函数有许多有用的属性。我们在下面介绍一些常用的。 + +### 本地 Minima 是全球最小值 + +首先,凸函数的局部最小值也是全局最小值。我们可以通过矛盾来证明这一点,如下所示。 + +考虑在凸集 $\mathcal{X}$ 上定义的凸函数 $f$。假设 $x^{\ast} \in \mathcal{X}$ 是局部最低值:存在一个小的正值 $p$,所以对于 $x \in \mathcal{X}$ 满足 $0 < |x - x^{\ast}| \leq p$,我们有 $f(x^{\ast}) < f(x)$。 + +假设本地最低位 $x^{\ast}$ 不是 $f$ 的全球最低值:存在 $x' \in \mathcal{X}$,其中 $f(x') < f(x^{\ast})$。还存在着 $\lambda \in [0, 1)$,例如 $\lambda = 1 - \frac{p}{|x^{\ast} - x'|}$,所以 $0 < |\lambda x^{\ast} + (1-\lambda) x' - x^{\ast}| \leq p$。 + +但是,根据凸函数的定义,我们有 + +$$\begin{aligned} + f(\lambda x^{\ast} + (1-\lambda) x') &\leq \lambda f(x^{\ast}) + (1-\lambda) f(x') \\ + &< \lambda f(x^{\ast}) + (1-\lambda) f(x^{\ast}) \\ + &= f(x^{\ast}), +\end{aligned}$$ + +这与我们关于 $x^{\ast}$ 是当地最低限度的说法相矛盾.因此,不存在 $x' \in \mathcal{X}$,其中 $f(x') < f(x^{\ast})$。当地最低值 $x^{\ast}$ 也是全球最低水平。 + +例如,凸函数 $f(x) = (x-1)^2$ 的局部最小值为 $x=1$,这也是全局最小值。 + +```{.python .input} +#@tab all +f = lambda x: (x - 1) ** 2 +d2l.set_figsize() +d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)') +``` + +凸函数的局部最小值也是全局最小值这一事实非常方便。这意味着,如果我们尽量减少功能,我们就不能 “卡住”。但是请注意,这并不意味着不能有一个以上的全局最低值,或者甚至可能存在一个。例如,函数 $f(x) = \mathrm{max}(|x|-1, 0)$ 在时间间隔 $[-1, 1]$ 内获得了最小值。相反,函数 $f(x) = \exp(x)$ 在 $\mathbb{R}$ 上没有达到最低值:对于 $x \to -\infty$,它渐近到 $0$,但没有 $x$,其中 $x$,其中 $f(x) = 0$。 + +### 下面的凸函数集是凸 + +我们可以通过凸函数的 * 下面的集合 * 来方便地定义凸集。具体来说,给定在凸集 $\mathcal{X}$ 上定义的凸函数 $f$,下面的任何一组 + +$$\mathcal{S}_b := \{x | x \in \mathcal{X} \text{ and } f(x) \leq b\}$$ + +是凸的。 + +让我们快速证明这一点。回想一下,对于任何 $x, x' \in \mathcal{S}_b$,我们都需要展示 $\lambda x + (1-\lambda) x' \in \mathcal{S}_b$ 只要 $\lambda \in [0, 1]$。自 $f(x) \leq b$ 和 $f(x') \leq b$ 以来,根据凸度的定义,我们有 + +$$f(\lambda x + (1-\lambda) x') \leq \lambda f(x) + (1-\lambda) f(x') \leq b.$$ + +### 凸度和第二衍生品 + +只要函数 $f: \mathbb{R}^n \rightarrow \mathbb{R}$ 的第二个导数存在,就很容易检查 $f$ 是否凸。我们所需要做的就是检查 $f$ 的黑森州是否为正半定性:$\nabla^2f \succeq 0$,即,表示黑森州矩阵 $\nabla^2f$ 乘 $\mathbf{H}$,$\mathbf{x}^\top \mathbf{H} \mathbf{x} \geq 0$ 表示所有 $\mathbf{x} \in \mathbb{R}^n$。例如,函数 $f(\mathbf{x}) = \frac{1}{2} \|\mathbf{x}\|^2$ 自 $\nabla^2 f = \mathbf{1}$ 以来就是凸的,也就是说,它的黑森语是一个身份矩阵。 + +从形式上来说,两次可分的一维函数 $f: \mathbb{R} \rightarrow \mathbb{R}$ 如果而且只有在其第二个导数 $f'' \geq 0$ 时是凸的。对于任何两次可分化的多维函数 $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$,如果而且仅当黑森州 $\nabla^2f \succeq 0$ 时,它是凸的。 + +首先,我们需要证明一维的情况。为了看到 $f$ 的凸度意味着 $f'' \geq 0$ 我们使用了这样一个事实: + +$$\frac{1}{2} f(x + \epsilon) + \frac{1}{2} f(x - \epsilon) \geq f\left(\frac{x + \epsilon}{2} + \frac{x - \epsilon}{2}\right) = f(x).$$ + +由于第二个衍生物是由有限差异的限制给出的,因此 + +$$f''(x) = \lim_{\epsilon \to 0} \frac{f(x+\epsilon) + f(x - \epsilon) - 2f(x)}{\epsilon^2} \geq 0.$$ + +为了看到 $f'' \geq 0$ 意味着 $f$ 是凸的,我们使用的事实是 $f'' \geq 0$ 意味着 $f'$ 是一个单调的非递减函数。让 $a < x < b$ 成为 $\mathbb{R}$ 中的三点,其中 $x = (1-\lambda)a + \lambda b$ 和 $\lambda \in (0, 1)$。根据平均值定理,存在 $\alpha \in [a, x]$ 和 $\beta \in [x, b]$ 这样 + +$$f'(\alpha) = \frac{f(x) - f(a)}{x-a} \text{ and } f'(\beta) = \frac{f(b) - f(x)}{b-x}.$$ + +因此,通过单调性 $f'(\beta) \geq f'(\alpha)$ + +$$\frac{x-a}{b-a}f(b) + \frac{b-x}{b-a}f(a) \geq f(x).$$ + +自 $x = (1-\lambda)a + \lambda b$ 以来,我们有 + +$$\lambda f(b) + (1-\lambda)f(a) \geq f((1-\lambda)a + \lambda b),$$ + +从而证明了凸度。 + +其次,在证明多维情况之前,我们需要一个词语:$f: \mathbb{R}^n \rightarrow \mathbb{R}$ 是凸的,如果且只有在所有 $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ + +$$g(z) \stackrel{\mathrm{def}}{=} f(z \mathbf{x} + (1-z) \mathbf{y}) \text{ where } z \in [0,1]$$ + +是凸的。 + +为了证明 $f$ 的凸度意味着 $g$ 是凸的,我们可以证明对于所有 $a, b, \lambda \in [0, 1]$(因此 $0 \leq \lambda a + (1-\lambda) b \leq 1$) + +$$\begin{aligned} &g(\lambda a + (1-\lambda) b)\\ +=&f\left(\left(\lambda a + (1-\lambda) b\right)\mathbf{x} + \left(1-\lambda a - (1-\lambda) b\right)\mathbf{y} \right)\\ +=&f\left(\lambda \left(a \mathbf{x} + (1-a) \mathbf{y}\right) + (1-\lambda) \left(b \mathbf{x} + (1-b) \mathbf{y}\right) \right)\\ +\leq& \lambda f\left(a \mathbf{x} + (1-a) \mathbf{y}\right) + (1-\lambda) f\left(b \mathbf{x} + (1-b) \mathbf{y}\right) \\ +=& \lambda g(a) + (1-\lambda) g(b). +\end{aligned}$$ + +为了证明情况,我们可以证明对于所有 $\lambda \in [0, 1]$ + +$$\begin{aligned} &f(\lambda \mathbf{x} + (1-\lambda) \mathbf{y})\\ +=&g(\lambda \cdot 1 + (1-\lambda) \cdot 0)\\ +\leq& \lambda g(1) + (1-\lambda) g(0) \\ +=& \lambda f(\mathbf{x}) + (1-\lambda) g(\mathbf{y}). +\end{aligned}$$ + +最后,使用上述词语和一维案例的结果,可以按如下方式证明多维情况。如果而且仅当所有 $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $g(z) \stackrel{\mathrm{def}}{=} f(z \mathbf{x} + (1-z) \mathbf{y})$ $g(z) \stackrel{\mathrm{def}}{=} f(z \mathbf{x} + (1-z) \mathbf{y})$(其中 $z \in [0,1]$)都是凸的情况下,多维函数 $f: \mathbb{R}^n \rightarrow \mathbb{R}$ 是凸的。根据一维情况,只有在 $g'' = (\mathbf{x} - \mathbf{y})^\top \mathbf{H}(\mathbf{x} - \mathbf{y}) \geq 0$ ($\mathbf{H} \stackrel{\mathrm{def}}{=} \nabla^2f$) 对于所有 $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$($\mathbf{H} \stackrel{\mathrm{def}}{=} \nabla^2f$)的情况下,这相当于 $\mathbf{H} \succeq 0$,根据正半定矩阵的定义,这相当于 $\mathbf{H} \succeq 0$。 + +## 限制 + +凸优化的一个不错的特性是它使我们能够有效地处理约束条件。也就是说,它使我们能够解决 * 受限的优化 * 形式的问题: + +$$\begin{aligned} \mathop{\mathrm{minimize~}}_{\mathbf{x}} & f(\mathbf{x}) \\ + \text{ subject to } & c_i(\mathbf{x}) \leq 0 \text{ for all } i \in \{1, \ldots, n\}, +\end{aligned}$$ + +其中 $f$ 是目标,函数 $c_i$ 是约束函数。看看这确实考虑了 $c_1(\mathbf{x}) = \|\mathbf{x}\|_2 - 1$ 的情况。在这种情况下,参数 $\mathbf{x}$ 受限于单位球。如果第二个约束是 $c_2(\mathbf{x}) = \mathbf{v}^\top \mathbf{x} + b$,那么这对应于放在半空间上的所有 $\mathbf{x}$。同时满足这两个限制条件等于选择一个球的一片。 + +### 拉格朗日 + +一般来说,解决受限的优化问题很困难。解决这个问题的一种方法来自于具有相当简单的直觉的物理学。想象一下盒子里面有一个球。球将滚到最低的地方,重力将与盒子侧面可以强加于球的力量平衡。简而言之,目标函数(即重力)的梯度将被约束函数的梯度所抵消(由于墙壁 “向后推”,球需要留在盒子内)。请注意,一些限制可能不活跃:球未触及的墙壁将无法对球施加任何力量。 + +跳过 * 拉格朗日 * $L$ 的推导,上述推理可以通过以下鞍点优化问题来表达: + +$$L(\mathbf{x}, \alpha_1, \ldots, \alpha_n) = f(\mathbf{x}) + \sum_{i=1}^n \alpha_i c_i(\mathbf{x}) \text{ where } \alpha_i \geq 0.$$ + +这里的变量 $\alpha_i$ ($i=1,\ldots,n$) 是所谓的 * 拉格朗日乘数 *,可以确保约束条件得到正确实施。选择它们足够大,以确保 $c_i(\mathbf{x}) \leq 0$ 适用于所有 $i$。例如,对于任何 $\mathbf{x}$,其中 $c_i(\mathbf{x}) < 0$ 当然,我们最终会选择 $\alpha_i = 0$。此外,这是一个鞍点优化问题,人们希望与所有 $\alpha_i$ 相比 * 最大化 * $L$,同时 * 最小化 * 相对于 $\mathbf{x}$。有丰富的文献解释了如何到达函数 $L(\mathbf{x}, \alpha_1, \ldots, \alpha_n)$。就我们的目的而言,只要知道 $L$ 的鞍点就足够了,是最好地解决最初的约束优化问题的地方。 + +### 处罚 + +至少 * 近似 * 满足受限优化问题的一种方法是调整拉格朗日 $L$。我们只需在目标功能 $f(x)$ 中添加 $\alpha_i c_i(\mathbf{x})$,而不是满足 $c_i(\mathbf{x}) \leq 0$。这可以确保限制条件不会受到太严重的违反。 + +事实上,我们一直在使用这个技巧。考虑 :numref:`sec_weight_decay` 中的体重衰减。在其中,我们将 $\frac{\lambda}{2} \|\mathbf{w}\|^2$ 添加到目标函数中,以确保 $\mathbf{w}$ 不会变得太大。从受限的优化角度来看,我们可以看出,这将确保部分半径为 $r$ 的 $\|\mathbf{w}\|^2 - r^2 \leq 0$。调整 $\lambda$ 的值可以让我们改变 $\mathbf{w}$ 的大小。 + +一般来说,增加罚款是确保大致满足约束条件的好方法。实际上,事实证明,这比确切的满意度要强劲得多。此外,对于非凸问题,许多在凸情况下使精确方法如此吸引力的属性(例如,最佳性)已不再存在。 + +### 预测 + +满足制约因素的另一种策略是预测。再次,我们之前遇到了它们,例如,在 :numref:`sec_rnn_scratch` 中处理渐变剪切时。在那里,我们确保渐变的长度以 $\theta$ 为限。 + +$$\mathbf{g} \leftarrow \mathbf{g} \cdot \mathrm{min}(1, \theta/\|\mathbf{g}\|).$$ + +事实证明,这是在 $\theta$ 半径 $\theta$ 的球上的 * 投影 * $\mathbf{g}$。更一般地说,在凸集上的投影 $\mathcal{X}$ 被定义为 + +$$\mathrm{Proj}_\mathcal{X}(\mathbf{x}) = \mathop{\mathrm{argmin}}_{\mathbf{x}' \in \mathcal{X}} \|\mathbf{x} - \mathbf{x}'\|,$$ + +这是 $\mathcal{X}$ 至 $\mathbf{x}$ 中的最接近点。 + +![Convex Projections.](../img/projections.svg) +:label:`fig_projections` + +预测的数学定义可能听起来有点抽象。:numref:`fig_projections` 更清楚地解释了这一点。里面我们有两个凸集,一个圆圈和一个钻石。在预测期间,两个集中的点(黄色)保持不变。两个集之外的积分(黑色)将投影到集合内部的点(红色),这些点与原始积分(黑色)相近。虽然对于 $L_2$ 球而言,这使方向保持不变,但一般情况并不一定如此,就像钻石的情况可以看出的那样。 + +凸投影的用途之一是计算稀疏权重矢量。在这种情况下,我们将重量矢量投射到 $L_1$ 球上,这是 :numref:`fig_projections` 中钻石表壳的通用版本。 + +## 摘要 + +在深度学习的背景下,凸函数的主要目的是激励优化算法并帮助我们详细了解它们。在下面我们将看到如何相应地得出梯度下降和随机梯度下降。 + +* 凸集的交叉点是凸的。工会不是。 +* 对凸函数的期望不低于期望的凸函数(詹森的不平等性)。 +* 如果而且只有当其 Hessian(第二衍生物的矩阵)为正半定值时,两次可分化的函数才是凸的。 +* 凸约束可以通过拉格朗日添加。在实践中,我们可以简单地将它们添加到客观功能中加一个惩罚。 +* 投影映射到最接近原始点的凸集中的点。 + +## 练习 + +1. 假设我们想通过绘制集合中点之间的所有线并检查线是否包含来验证集合的凸度。 + 1. 证明只检查边界上的点就足够了。 + 1. 证明只检查集合的顶点就足够了。 +1. 用 $\mathcal{B}_p[r] \stackrel{\mathrm{def}}{=} \{\mathbf{x} | \mathbf{x} \in \mathbb{R}^d \text{ and } \|\mathbf{x}\|_p \leq r\}$ 表示使用 $p$ 标准的半径 $r$ 的球。证明所有 $p \geq 1$ 对于所有 $p \geq 1$ 来说,$\mathcal{B}_p[r]$ 都是凸的。 +1. 给定凸函数 $f$ 和 $g$,表明 $\mathrm{max}(f, g)$ 也是凸的。证明 $\mathrm{min}(f, g)$ 不是凸起的。 +1. 证明 softmax 函数的标准化是凸的。更具体地说,证明了 $f(x) = \log \sum_i \exp(x_i)$ 的凸度。 +1. 证明线性子空间,即 $\mathcal{X} = \{\mathbf{x} | \mathbf{W} \mathbf{x} = \mathbf{b}\}$,是凸集。 +1. 证明,对于 $\mathbf{b} = \mathbf{0}$ 的线性子空间,对于某些矩阵 $\mathbf{M}$,投影 $\mathrm{Proj}_\mathcal{X}$ 可以写为 $\mathbf{M} \mathbf{x}$。 +1. 显示,对于两次可分的凸函数 $f$,我们可以为大约 $\xi \in [0, \epsilon]$ 写 $f(x + \epsilon) = f(x) + \epsilon f'(x) + \frac{1}{2} \epsilon^2 f''(x + \xi)$。 +1. 给定向量 $\mathbf{w} \in \mathbb{R}^d$ 和 $\|\mathbf{w}\|_1 > 1$,计算 $L_1$ 单位球上的投影。 + 1. 作为中间步骤,写出受惩的目标 $\|\mathbf{w} - \mathbf{w}'\|^2 + \lambda \|\mathbf{w}'\|_1$ 并计算给定 $\lambda > 0$ 的解决方案。 + 1. 你能找到 $\lambda$ 的 “正确” 值没有经过很多试验和错误吗? +1. 鉴于凸集 $\mathcal{X}$ 和两个向量 $\mathbf{x}$ 和 $\mathbf{y}$,证明预测永远不会增加距离,即 $\|\mathbf{x} - \mathbf{y}\| \geq \|\mathrm{Proj}_\mathcal{X}(\mathbf{x}) - \mathrm{Proj}_\mathcal{X}(\mathbf{y})\|$。 + +[Discussions](https://discuss.d2l.ai/t/350) diff --git a/chapter_optimization/convexity_origin.md b/chapter_optimization/convexity_origin.md new file mode 100644 index 000000000..a892381ff --- /dev/null +++ b/chapter_optimization/convexity_origin.md @@ -0,0 +1,377 @@ +# Convexity +:label:`sec_convexity` + +Convexity plays a vital role in the design of optimization algorithms. +This is largely due to the fact that it is much easier to analyze and test algorithms in such a context. +In other words, +if the algorithm performs poorly even in the convex setting, +typically we should not hope to see great results otherwise. +Furthermore, even though the optimization problems in deep learning are generally nonconvex, they often exhibit some properties of convex ones near local minima. This can lead to exciting new optimization variants such as :cite:`Izmailov.Podoprikhin.Garipov.ea.2018`. + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mpl_toolkits import mplot3d +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import numpy as np +from mpl_toolkits import mplot3d +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import numpy as np +from mpl_toolkits import mplot3d +import tensorflow as tf +``` + +## Definitions + +Before convex analysis, +we need to define *convex sets* and *convex functions*. +They lead to mathematical tools that are commonly applied to machine learning. + + +### Convex Sets + +Sets are the basis of convexity. Simply put, a set $\mathcal{X}$ in a vector space is *convex* if for any $a, b \in \mathcal{X}$ the line segment connecting $a$ and $b$ is also in $\mathcal{X}$. In mathematical terms this means that for all $\lambda \in [0, 1]$ we have + +$$\lambda a + (1-\lambda) b \in \mathcal{X} \text{ whenever } a, b \in \mathcal{X}.$$ + +This sounds a bit abstract. Consider :numref:`fig_pacman`. The first set is not convex since there exist line segments that are not contained in it. +The other two sets suffer no such problem. + +![The first set is nonconvex and the other two are convex.](../img/pacman.svg) +:label:`fig_pacman` + +Definitions on their own are not particularly useful unless you can do something with them. +In this case we can look at intersections as shown in :numref:`fig_convex_intersect`. +Assume that $\mathcal{X}$ and $\mathcal{Y}$ are convex sets. Then $\mathcal{X} \cap \mathcal{Y}$ is also convex. To see this, consider any $a, b \in \mathcal{X} \cap \mathcal{Y}$. Since $\mathcal{X}$ and $\mathcal{Y}$ are convex, the line segments connecting $a$ and $b$ are contained in both $\mathcal{X}$ and $\mathcal{Y}$. Given that, they also need to be contained in $\mathcal{X} \cap \mathcal{Y}$, thus proving our theorem. + +![The intersection between two convex sets is convex.](../img/convex-intersect.svg) +:label:`fig_convex_intersect` + +We can strengthen this result with little effort: given convex sets $\mathcal{X}_i$, their intersection $\cap_{i} \mathcal{X}_i$ is convex. +To see that the converse is not true, consider two disjoint sets $\mathcal{X} \cap \mathcal{Y} = \emptyset$. Now pick $a \in \mathcal{X}$ and $b \in \mathcal{Y}$. The line segment in :numref:`fig_nonconvex` connecting $a$ and $b$ needs to contain some part that is neither in $\mathcal{X}$ nor in $\mathcal{Y}$, since we assumed that $\mathcal{X} \cap \mathcal{Y} = \emptyset$. Hence the line segment is not in $\mathcal{X} \cup \mathcal{Y}$ either, thus proving that in general unions of convex sets need not be convex. + +![The union of two convex sets need not be convex.](../img/nonconvex.svg) +:label:`fig_nonconvex` + +Typically the problems in deep learning are defined on convex sets. For instance, $\mathbb{R}^d$, +the set of $d$-dimensional vectors of real numbers, +is a convex set (after all, the line between any two points in $\mathbb{R}^d$ remains in $\mathbb{R}^d$). In some cases we work with variables of bounded length, such as balls of radius $r$ as defined by $\{\mathbf{x} | \mathbf{x} \in \mathbb{R}^d \text{ and } \|\mathbf{x}\| \leq r\}$. + +### Convex Functions + +Now that we have convex sets we can introduce *convex functions* $f$. +Given a convex set $\mathcal{X}$, a function $f: \mathcal{X} \to \mathbb{R}$ is *convex* if for all $x, x' \in \mathcal{X}$ and for all $\lambda \in [0, 1]$ we have + +$$\lambda f(x) + (1-\lambda) f(x') \geq f(\lambda x + (1-\lambda) x').$$ + +To illustrate this let us plot a few functions and check which ones satisfy the requirement. +Below we define a few functions, both convex and nonconvex. + +```{.python .input} +#@tab all +f = lambda x: 0.5 * x**2 # Convex +g = lambda x: d2l.cos(np.pi * x) # Nonconvex +h = lambda x: d2l.exp(0.5 * x) # Convex + +x, segment = d2l.arange(-2, 2, 0.01), d2l.tensor([-1.5, 1]) +d2l.use_svg_display() +_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3)) +for ax, func in zip(axes, [f, g, h]): + d2l.plot([x, segment], [func(x), func(segment)], axes=ax) +``` + +As expected, the cosine function is *nonconvex*, whereas the parabola and the exponential function are. Note that the requirement that $\mathcal{X}$ is a convex set is necessary for the condition to make sense. Otherwise the outcome of $f(\lambda x + (1-\lambda) x')$ might not be well defined. + + +### Jensen's Inequality + +Given a convex function $f$, +one of the most useful mathematical tools +is *Jensen's inequality*. +It amounts to a generalization of the definition of convexity: + +$$\sum_i \alpha_i f(x_i) \geq f\left(\sum_i \alpha_i x_i\right) \text{ and } E_X[f(X)] \geq f\left(E_X[X]\right),$$ +:eqlabel:`eq_jensens-inequality` + +where $\alpha_i$ are nonnegative real numbers such that $\sum_i \alpha_i = 1$ and $X$ is a random variable. +In other words, the expectation of a convex function is no less than the convex function of an expectation, where the latter is usually a simpler expression. +To prove the first inequality we repeatedly apply the definition of convexity to one term in the sum at a time. + + +One of the common applications of Jensen's inequality is +to bound a more complicated expression by a simpler one. +For example, +its application can be +with regard to the log-likelihood of partially observed random variables. That is, we use + +$$E_{Y \sim P(Y)}[-\log P(X \mid Y)] \geq -\log P(X),$$ + +since $\int P(Y) P(X \mid Y) dY = P(X)$. +This can be used in variational methods. Here $Y$ is typically the unobserved random variable, $P(Y)$ is the best guess of how it might be distributed, and $P(X)$ is the distribution with $Y$ integrated out. For instance, in clustering $Y$ might be the cluster labels and $P(X \mid Y)$ is the generative model when applying cluster labels. + + + +## Properties + +Convex functions have many useful properties. We describe a few commonly-used ones below. + + +### Local Minima Are Global Minima + +First and foremost, the local minima of convex functions are also the global minima. +We can prove it by contradiction as follows. + +Consider a convex function $f$ defined on a convex set $\mathcal{X}$. +Suppose that $x^{\ast} \in \mathcal{X}$ is a local minimum: +there exists a small positive value $p$ so that for $x \in \mathcal{X}$ that satisfies $0 < |x - x^{\ast}| \leq p$ we have $f(x^{\ast}) < f(x)$. + +Assume that the local minimum $x^{\ast}$ +is not the global minumum of $f$: +there exists $x' \in \mathcal{X}$ for which $f(x') < f(x^{\ast})$. +There also exists +$\lambda \in [0, 1)$ such as $\lambda = 1 - \frac{p}{|x^{\ast} - x'|}$ +so that +$0 < |\lambda x^{\ast} + (1-\lambda) x' - x^{\ast}| \leq p$. + +However, +according to the definition of convex functions, we have + +$$\begin{aligned} + f(\lambda x^{\ast} + (1-\lambda) x') &\leq \lambda f(x^{\ast}) + (1-\lambda) f(x') \\ + &< \lambda f(x^{\ast}) + (1-\lambda) f(x^{\ast}) \\ + &= f(x^{\ast}), +\end{aligned}$$ + +which contradicts with our statement that $x^{\ast}$ is a local minimum. +Therefore, there does not exist $x' \in \mathcal{X}$ for which $f(x') < f(x^{\ast})$. The local minimum $x^{\ast}$ is also the global minimum. + +For instance, the convex function $f(x) = (x-1)^2$ has a local minimum at $x=1$, which is also the global minimum. + +```{.python .input} +#@tab all +f = lambda x: (x - 1) ** 2 +d2l.set_figsize() +d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)') +``` + +The fact that the local minima for convex functions are also the global minima is very convenient. +It means that if we minimize functions we cannot "get stuck". +Note, though, that this does not mean that there cannot be more than one global minimum or that there might even exist one. For instance, the function $f(x) = \mathrm{max}(|x|-1, 0)$ attains its minimum value over the interval $[-1, 1]$. Conversely, the function $f(x) = \exp(x)$ does not attain a minimum value on $\mathbb{R}$: for $x \to -\infty$ it asymptotes to $0$, but there is no $x$ for which $f(x) = 0$. + +### Below Sets of Convex Functions Are Convex + +We can conveniently +define convex sets +via *below sets* of convex functions. +Concretely, +given a convex function $f$ defined on a convex set $\mathcal{X}$, +any below set + +$$\mathcal{S}_b := \{x | x \in \mathcal{X} \text{ and } f(x) \leq b\}$$ + +is convex. + +Let us prove this quickly. Recall that for any $x, x' \in \mathcal{S}_b$ we need to show that $\lambda x + (1-\lambda) x' \in \mathcal{S}_b$ as long as $\lambda \in [0, 1]$. +Since $f(x) \leq b$ and $f(x') \leq b$, +by the definition of convexity we have + +$$f(\lambda x + (1-\lambda) x') \leq \lambda f(x) + (1-\lambda) f(x') \leq b.$$ + + +### Convexity and Second Derivatives + +Whenever the second derivative of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ exists it is very easy to check whether $f$ is convex. +All we need to do is check whether the Hessian of $f$ is positive semidefinite: $\nabla^2f \succeq 0$, i.e., +denoting the Hessian matrix $\nabla^2f$ by $\mathbf{H}$, +$\mathbf{x}^\top \mathbf{H} \mathbf{x} \geq 0$ +for all $\mathbf{x} \in \mathbb{R}^n$. +For instance, the function $f(\mathbf{x}) = \frac{1}{2} \|\mathbf{x}\|^2$ is convex since $\nabla^2 f = \mathbf{1}$, i.e., its Hessian is an identity matrix. + + +Formally, a twice-differentiable one-dimensional function $f: \mathbb{R} \rightarrow \mathbb{R}$ is convex +if and only if its second derivative $f'' \geq 0$. For any twice-differentiable multi-dimensional function $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$, +it is convex if and only if its Hessian $\nabla^2f \succeq 0$. + +First, we need to prove the one-dimensional case. +To see that +convexity of $f$ implies +$f'' \geq 0$ we use the fact that + +$$\frac{1}{2} f(x + \epsilon) + \frac{1}{2} f(x - \epsilon) \geq f\left(\frac{x + \epsilon}{2} + \frac{x - \epsilon}{2}\right) = f(x).$$ + +Since the second derivative is given by the limit over finite differences it follows that + +$$f''(x) = \lim_{\epsilon \to 0} \frac{f(x+\epsilon) + f(x - \epsilon) - 2f(x)}{\epsilon^2} \geq 0.$$ + +To see that +$f'' \geq 0$ implies that $f$ is convex +we use the fact that $f'' \geq 0$ implies that $f'$ is a monotonically nondecreasing function. Let $a < x < b$ be three points in $\mathbb{R}$, +where $x = (1-\lambda)a + \lambda b$ and $\lambda \in (0, 1)$. +According to the mean value theorem, +there exist $\alpha \in [a, x]$ and $\beta \in [x, b]$ +such that + +$$f'(\alpha) = \frac{f(x) - f(a)}{x-a} \text{ and } f'(\beta) = \frac{f(b) - f(x)}{b-x}.$$ + + +By monotonicity $f'(\beta) \geq f'(\alpha)$, hence + +$$\frac{x-a}{b-a}f(b) + \frac{b-x}{b-a}f(a) \geq f(x).$$ + +Since $x = (1-\lambda)a + \lambda b$, +we have + +$$\lambda f(b) + (1-\lambda)f(a) \geq f((1-\lambda)a + \lambda b),$$ + +thus proving convexity. + +Second, we need a lemma before +proving the multi-dimensional case: +$f: \mathbb{R}^n \rightarrow \mathbb{R}$ +is convex if and only if for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ + +$$g(z) \stackrel{\mathrm{def}}{=} f(z \mathbf{x} + (1-z) \mathbf{y}) \text{ where } z \in [0,1]$$ + +is convex. + +To prove that convexity of $f$ implies that $g$ is convex, +we can show that for all $a, b, \lambda \in [0, 1]$ (thus +$0 \leq \lambda a + (1-\lambda) b \leq 1$) + +$$\begin{aligned} &g(\lambda a + (1-\lambda) b)\\ +=&f\left(\left(\lambda a + (1-\lambda) b\right)\mathbf{x} + \left(1-\lambda a - (1-\lambda) b\right)\mathbf{y} \right)\\ +=&f\left(\lambda \left(a \mathbf{x} + (1-a) \mathbf{y}\right) + (1-\lambda) \left(b \mathbf{x} + (1-b) \mathbf{y}\right) \right)\\ +\leq& \lambda f\left(a \mathbf{x} + (1-a) \mathbf{y}\right) + (1-\lambda) f\left(b \mathbf{x} + (1-b) \mathbf{y}\right) \\ +=& \lambda g(a) + (1-\lambda) g(b). +\end{aligned}$$ + +To prove the converse, +we can show that for +all $\lambda \in [0, 1]$ + +$$\begin{aligned} &f(\lambda \mathbf{x} + (1-\lambda) \mathbf{y})\\ +=&g(\lambda \cdot 1 + (1-\lambda) \cdot 0)\\ +\leq& \lambda g(1) + (1-\lambda) g(0) \\ +=& \lambda f(\mathbf{x}) + (1-\lambda) g(\mathbf{y}). +\end{aligned}$$ + + +Finally, +using the lemma above and the result of the one-dimensional case, +the multi-dimensional case +can be proven as follows. +A multi-dimensional function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex +if and only if for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $g(z) \stackrel{\mathrm{def}}{=} f(z \mathbf{x} + (1-z) \mathbf{y})$, where $z \in [0,1]$, +is convex. +According to the one-dimensional case, +this holds if and only if +$g'' = (\mathbf{x} - \mathbf{y})^\top \mathbf{H}(\mathbf{x} - \mathbf{y}) \geq 0$ ($\mathbf{H} \stackrel{\mathrm{def}}{=} \nabla^2f$) +for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, +which is equivalent to $\mathbf{H} \succeq 0$ +per the definition of positive semidefinite matrices. + + +## Constraints + +One of the nice properties of convex optimization is that it allows us to handle constraints efficiently. That is, it allows us to solve *constrained optimization* problems of the form: + +$$\begin{aligned} \mathop{\mathrm{minimize~}}_{\mathbf{x}} & f(\mathbf{x}) \\ + \text{ subject to } & c_i(\mathbf{x}) \leq 0 \text{ for all } i \in \{1, \ldots, n\}, +\end{aligned}$$ + +where $f$ is the objective and the functions $c_i$ are constraint functions. To see what this does consider the case where $c_1(\mathbf{x}) = \|\mathbf{x}\|_2 - 1$. In this case the parameters $\mathbf{x}$ are constrained to the unit ball. If a second constraint is $c_2(\mathbf{x}) = \mathbf{v}^\top \mathbf{x} + b$, then this corresponds to all $\mathbf{x}$ lying on a half-space. Satisfying both constraints simultaneously amounts to selecting a slice of a ball. + +### Lagrangian + +In general, solving a constrained optimization problem is difficult. One way of addressing it stems from physics with a rather simple intuition. Imagine a ball inside a box. The ball will roll to the place that is lowest and the forces of gravity will be balanced out with the forces that the sides of the box can impose on the ball. In short, the gradient of the objective function (i.e., gravity) will be offset by the gradient of the constraint function (the ball need to remain inside the box by virtue of the walls "pushing back"). +Note that some constraints may not be active: +the walls that are not touched by the ball +will not be able to exert any force on the ball. + + +Skipping over the derivation of the *Lagrangian* $L$, +the above reasoning +can be expressed via the following saddle point optimization problem: + +$$L(\mathbf{x}, \alpha_1, \ldots, \alpha_n) = f(\mathbf{x}) + \sum_{i=1}^n \alpha_i c_i(\mathbf{x}) \text{ where } \alpha_i \geq 0.$$ + +Here the variables $\alpha_i$ ($i=1,\ldots,n$) are the so-called *Lagrange multipliers* that ensure that constraints are properly enforced. They are chosen just large enough to ensure that $c_i(\mathbf{x}) \leq 0$ for all $i$. For instance, for any $\mathbf{x}$ where $c_i(\mathbf{x}) < 0$ naturally, we'd end up picking $\alpha_i = 0$. Moreover, this is a saddle point optimization problem where one wants to *maximize* $L$ with respect to all $\alpha_i$ and simultaneously *minimize* it with respect to $\mathbf{x}$. There is a rich body of literature explaining how to arrive at the function $L(\mathbf{x}, \alpha_1, \ldots, \alpha_n)$. For our purposes it is sufficient to know that the saddle point of $L$ is where the original constrained optimization problem is solved optimally. + +### Penalties + +One way of satisfying constrained optimization problems at least *approximately* is to adapt the Lagrangian $L$. +Rather than satisfying $c_i(\mathbf{x}) \leq 0$ we simply add $\alpha_i c_i(\mathbf{x})$ to the objective function $f(x)$. This ensures that the constraints will not be violated too badly. + +In fact, we have been using this trick all along. Consider weight decay in :numref:`sec_weight_decay`. In it we add $\frac{\lambda}{2} \|\mathbf{w}\|^2$ to the objective function to ensure that $\mathbf{w}$ does not grow too large. From the constrained optimization point of view we can see that this will ensure that $\|\mathbf{w}\|^2 - r^2 \leq 0$ for some radius $r$. Adjusting the value of $\lambda$ allows us to vary the size of $\mathbf{w}$. + +In general, adding penalties is a good way of ensuring approximate constraint satisfaction. In practice this turns out to be much more robust than exact satisfaction. Furthermore, for nonconvex problems many of the properties that make the exact approach so appealing in the convex case (e.g., optimality) no longer hold. + +### Projections + +An alternative strategy for satisfying constraints is projections. Again, we encountered them before, e.g., when dealing with gradient clipping in :numref:`sec_rnn_scratch`. There we ensured that a gradient has length bounded by $\theta$ via + +$$\mathbf{g} \leftarrow \mathbf{g} \cdot \mathrm{min}(1, \theta/\|\mathbf{g}\|).$$ + +This turns out to be a *projection* of $\mathbf{g}$ onto the ball of radius $\theta$. More generally, a projection on a convex set $\mathcal{X}$ is defined as + +$$\mathrm{Proj}_\mathcal{X}(\mathbf{x}) = \mathop{\mathrm{argmin}}_{\mathbf{x}' \in \mathcal{X}} \|\mathbf{x} - \mathbf{x}'\|,$$ + +which is the closest point in $\mathcal{X}$ to $\mathbf{x}$. + +![Convex Projections.](../img/projections.svg) +:label:`fig_projections` + +The mathematical definition of projections may sound a bit abstract. :numref:`fig_projections` explains it somewhat more clearly. In it we have two convex sets, a circle and a diamond. +Points inside both sets (yellow) remain unchanged during projections. +Points outside both sets (black) are projected to +the points inside the sets (red) that are closet to the original points (black). +While for $L_2$ balls this leaves the direction unchanged, this need not be the case in general, as can be seen in the case of the diamond. + + +One of the uses for convex projections is to compute sparse weight vectors. In this case we project weight vectors onto an $L_1$ ball, +which is a generalized version of the diamond case in :numref:`fig_projections`. + + +## Summary + +In the context of deep learning the main purpose of convex functions is to motivate optimization algorithms and help us understand them in detail. In the following we will see how gradient descent and stochastic gradient descent can be derived accordingly. + + +* Intersections of convex sets are convex. Unions are not. +* The expectation of a convex function is no less than the convex function of an expectation (Jensen's inequality). +* A twice-differentiable function is convex if and only if its Hessian (a matrix of second derivatives) is positive semidefinite. +* Convex constraints can be added via the Lagrangian. In practice we may simply add them with a penalty to the objective function. +* Projections map to points in the convex set closest to the original points. + +## Exercises + +1. Assume that we want to verify convexity of a set by drawing all lines between points within the set and checking whether the lines are contained. + 1. Prove that it is sufficient to check only the points on the boundary. + 1. Prove that it is sufficient to check only the vertices of the set. +1. Denote by $\mathcal{B}_p[r] \stackrel{\mathrm{def}}{=} \{\mathbf{x} | \mathbf{x} \in \mathbb{R}^d \text{ and } \|\mathbf{x}\|_p \leq r\}$ the ball of radius $r$ using the $p$-norm. Prove that $\mathcal{B}_p[r]$ is convex for all $p \geq 1$. +1. Given convex functions $f$ and $g$, show that $\mathrm{max}(f, g)$ is convex, too. Prove that $\mathrm{min}(f, g)$ is not convex. +1. Prove that the normalization of the softmax function is convex. More specifically prove the convexity of + $f(x) = \log \sum_i \exp(x_i)$. +1. Prove that linear subspaces, i.e., $\mathcal{X} = \{\mathbf{x} | \mathbf{W} \mathbf{x} = \mathbf{b}\}$, are convex sets. +1. Prove that in the case of linear subspaces with $\mathbf{b} = \mathbf{0}$ the projection $\mathrm{Proj}_\mathcal{X}$ can be written as $\mathbf{M} \mathbf{x}$ for some matrix $\mathbf{M}$. +1. Show that for twice-differentiable convex functions $f$ we can write $f(x + \epsilon) = f(x) + \epsilon f'(x) + \frac{1}{2} \epsilon^2 f''(x + \xi)$ for some $\xi \in [0, \epsilon]$. +1. Given a vector $\mathbf{w} \in \mathbb{R}^d$ with $\|\mathbf{w}\|_1 > 1$ compute the projection on the $L_1$ unit ball. + 1. As an intermediate step write out the penalized objective $\|\mathbf{w} - \mathbf{w}'\|^2 + \lambda \|\mathbf{w}'\|_1$ and compute the solution for a given $\lambda > 0$. + 1. Can you find the "right" value of $\lambda$ without a lot of trial and error? +1. Given a convex set $\mathcal{X}$ and two vectors $\mathbf{x}$ and $\mathbf{y}$, prove that projections never increase distances, i.e., $\|\mathbf{x} - \mathbf{y}\| \geq \|\mathrm{Proj}_\mathcal{X}(\mathbf{x}) - \mathrm{Proj}_\mathcal{X}(\mathbf{y})\|$. + + +[Discussions](https://discuss.d2l.ai/t/350) diff --git a/chapter_optimization/gd.md b/chapter_optimization/gd.md new file mode 100644 index 000000000..1c8534bea --- /dev/null +++ b/chapter_optimization/gd.md @@ -0,0 +1,323 @@ +# 渐变下降 +:label:`sec_gd` + +在本节中,我们将介绍 * 梯度下降 * 的基本概念。尽管很少在深度学习中直接使用,但了解梯度下降是了解随机梯度下降算法的关键。例如,由于学习率过高,优化问题可能会分歧。这种现象已经可以从梯度下降中看出来。同样,预处理是梯度下降的常见技术,可以继续使用更高级的算法。让我们从一个简单的特殊情况开始。 + +## 一维梯度下降 + +一个维度的梯度下降是一个很好的例子,可以解释为什么梯度下降算法可能会降低目标函数的值。考虑一些持续差异的实值函数 $f: \mathbb{R} \rightarrow \mathbb{R}$。使用泰勒扩张我们获得 + +$$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2).$$ +:eqlabel:`gd-taylor` + +也就是说,一阶近似 $f(x+\epsilon)$ 是由函数值 $f(x)$ 和第一个导数 $f'(x)$ 给出的,为 $x$。假设对于小 $\epsilon$ 而言,朝负梯度方向移动将减少 $f$ 并非不合理。为了简单起见,我们选择固定的步长 $\eta > 0$ 然后选择 $\epsilon = -\eta f'(x)$。把这个插入上面的泰勒扩张我们得到了 + +$$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x)).$$ +:eqlabel:`gd-taylor-2` + +如果衍生品 $f'(x) \neq 0$ 没有消失,我们会从 $\eta f'^2(x)>0$ 开始取得进展。此外,我们总是可以选择足够小的 $\eta$ 以使高阶条款变得无关紧要。因此我们到达 + +$$f(x - \eta f'(x)) \lessapprox f(x).$$ + +这意味着,如果我们使用 + +$$x \leftarrow x - \eta f'(x)$$ + +为了迭代 $x$,函数 $f(x)$ 的值可能会下降。因此,在梯度下降中,我们首先选择初始值 $x$ 和一个常数 $\eta > 0$,然后使用它们连续迭代 $x$ 直到达到停止条件,例如,当梯度 $|f'(x)|$ 的幅度足够小或迭代次数已达到一定价值。 + +为简单起见,我们选择目标函数 $f(x)=x^2$ 来说明如何实现梯度下降。尽管我们知道 $x=0$ 是最小化 $f(x)$ 的解决方案,但我们仍然使用这个简单的函数来观察 $x$ 如何变化。 + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import numpy as np +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import numpy as np +import tensorflow as tf +``` + +```{.python .input} +#@tab all +def f(x): # Objective function + return x ** 2 + +def f_grad(x): # Gradient (derivative) of the objective function + return 2 * x +``` + +接下来,我们使用 $x=10$ 作为初始值,并假设 $\eta=0.2$。使用梯度下降对 $x$ 进行 10 次迭代,我们可以看到,最终,$x$ 的值接近最佳解决方案。 + +```{.python .input} +#@tab all +def gd(eta, f_grad): + x = 10.0 + results = [x] + for i in range(10): + x -= eta * f_grad(x) + results.append(float(x)) + print(f'epoch 10, x: {x:f}') + return results + +results = gd(0.2, f_grad) +``` + +如下所示,优化超过 $x$ 的进展情况。 + +```{.python .input} +#@tab all +def show_trace(results, f): + n = max(abs(min(results)), abs(max(results))) + f_line = d2l.arange(-n, n, 0.01) + d2l.set_figsize() + d2l.plot([f_line, results], [[f(x) for x in f_line], [ + f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o']) + +show_trace(results, f) +``` + +### 学习率 +:label:`subsec_gd-learningrate` + +学习率 $\eta$ 可以由算法设计师设置。如果我们使用的学习率太低,将导致 $x$ 的更新速度非常缓慢,需要更多的迭代才能获得更好的解决方案。要显示在这种情况下会发生什么,请考虑 $\eta = 0.05$ 的同一优化问题的进展情况。正如我们所看到的那样,即使在 10 个步骤之后,我们还远离最佳解决方案。 + +```{.python .input} +#@tab all +show_trace(gd(0.05, f_grad), f) +``` + +相反,如果我们使用过高的学习率,$\left|\eta f'(x)\right|$ 对于一阶泰勒扩张公式来说可能太大了。也就是说,:eqref:`gd-taylor-2` 中的术语 $\mathcal{O}(\eta^2 f'^2(x))$ 可能会变得重要。在这种情况下,我们无法保证 $x$ 的迭代能够降低 $f(x)$ 的值。例如,当我们将学习率设置为 $\eta=1.1$ 时,$x$ 超出了最佳解决方案 $x=0$ 并逐渐发散。 + +```{.python .input} +#@tab all +show_trace(gd(1.1, f_grad), f) +``` + +### 本地迷你 + +为了说明非凸函数会发生什么情况,请考虑 $f(x) = x \cdot \cos(cx)$ 对于某个常数 $c$ 的情况。这个函数有无限多个本地最小值。根据我们对学习率的选择以及问题的条件有多好,我们最终可能会找到许多解决方案之一。下面的例子说明了(不现实的)高学习率如何导致较差的本地最低水平。 + +```{.python .input} +#@tab all +c = d2l.tensor(0.15 * np.pi) + +def f(x): # Objective function + return x * d2l.cos(c * x) + +def f_grad(x): # Gradient of the objective function + return d2l.cos(c * x) - c * x * d2l.sin(c * x) + +show_trace(gd(2, f_grad), f) +``` + +## 多变量渐变下降 + +现在我们对单变量案有了更好的直觉,让我们来考虑 $\mathbf{x} = [x_1, x_2, \ldots, x_d]^\top$ 的情况。也就是说,目标函数 $f: \mathbb{R}^d \to \mathbb{R}$ 将向量映射为标量。相应地,它的渐变也是多变量的。它是由 $d$ 部分衍生品组成的向量: + +$$\nabla f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}\bigg]^\top.$$ + +梯度中的每个部分衍生元素 $\partial f(\mathbf{x})/\partial x_i$ 表示相对于输入 $x_i$ 的 $f$ 的变化率为 $\mathbf{x}$,为 $\mathbf{x}$。和以前一样,在单变量的情况下,我们可以使用相应的泰勒近似值作为多变量函数来了解我们应该做什么。特别是,我们有 + +$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \mathbf{\boldsymbol{\epsilon}}^\top \nabla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2).$$ +:eqlabel:`gd-multi-taylor` + +换句话说,$\boldsymbol{\epsilon}$ 中的二阶术语,最陡的下降方向是由负梯度 $-\nabla f(\mathbf{x})$ 给出的。选择合适的学习率 $\eta > 0$ 可以产生原型的梯度下降算法: + +$$\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f(\mathbf{x}).$$ + +要了解算法在实践中的行为,让我们构建一个目标函数 $f(\mathbf{x})=x_1^2+2x_2^2$,其中二维矢量 $\mathbf{x} = [x_1, x_2]^\top$ 作为输入,标量作为输出。梯度由 $\nabla f(\mathbf{x}) = [2x_1, 4x_2]^\top$ 给出。我们将通过从初始位置 $[-5, -2]$ 的梯度下降观察 $\mathbf{x}$ 的轨迹。 + +首先,我们还需要两个辅助函数。第一个使用更新函数并将其应用于初始值 20 次。第二个助手可视化了 $\mathbf{x}$ 的轨迹。 + +```{.python .input} +#@tab all +def train_2d(trainer, steps=20, f_grad=None): #@save + """Optimize a 2D objective function with a customized trainer.""" + # `s1` and `s2` are internal state variables that will be used later + x1, x2, s1, s2 = -5, -2, 0, 0 + results = [(x1, x2)] + for i in range(steps): + if f_grad: + x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad) + else: + x1, x2, s1, s2 = trainer(x1, x2, s1, s2) + results.append((x1, x2)) + print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}') + return results + +def show_trace_2d(f, results): #@save + """Show the trace of 2D variables during optimization.""" + d2l.set_figsize() + d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e') + x1, x2 = d2l.meshgrid(d2l.arange(-5.5, 1.0, 0.1), + d2l.arange(-3.0, 1.0, 0.1)) + d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4') + d2l.plt.xlabel('x1') + d2l.plt.ylabel('x2') +``` + +接下来,我们观察学习率 $\eta = 0.1$ 的优化变量 $\mathbf{x}$ 的轨迹。我们可以看到,经过 20 个步骤,$\mathbf{x}$ 的价值接近 $[0, 0]$ 的最低水平。进展情况相当不错,尽管相当缓慢。 + +```{.python .input} +#@tab all +def f_2d(x1, x2): # Objective function + return x1 ** 2 + 2 * x2 ** 2 + +def f_2d_grad(x1, x2): # Gradient of the objective function + return (2 * x1, 4 * x2) + +def gd_2d(x1, x2, s1, s2, f_grad): + g1, g2 = f_grad(x1, x2) + return (x1 - eta * g1, x2 - eta * g2, 0, 0) + +eta = 0.1 +show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad)) +``` + +## 自适应方法 + +正如我们在 :numref:`subsec_gd-learningrate` 中看到的那样,获得 $\eta$ “恰到好处” 的学习率是棘手的。如果我们选择太小,我们就没有什么进展。如果我们选择太大,解决方案就会振荡,在最坏的情况下,它甚至可能会分歧。如果我们可以自动确定 $\eta$ 或者根本不必选择学习率,该怎么办?二阶方法不仅看目标函数的价值和梯度,而且还查看其 *curvature* 在这种情况下可以有所帮助。虽然这些方法由于计算成本不能直接应用于深度学习,但它们为如何设计模仿下面概述的算法的许多理想属性的高级优化算法提供了有用的直觉。 + +### 牛顿的方法 + +回顾泰勒对某些职能 $f: \mathbb{R}^d \rightarrow \mathbb{R}$ 的扩张,在第一个任期之后没有必要停止。事实上,我们可以把它写成 + +$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^\top \nabla f(\mathbf{x}) + \frac{1}{2} \boldsymbol{\epsilon}^\top \nabla^2 f(\mathbf{x}) \boldsymbol{\epsilon} + \mathcal{O}(\|\boldsymbol{\epsilon}\|^3).$$ +:eqlabel:`gd-hot-taylor` + +为了避免繁琐的符号,我们将 $\mathbf{H} \stackrel{\mathrm{def}}{=} \nabla^2 f(\mathbf{x})$ 定义为 $f$ 的黑森语,这是一个 $d \times d$ 矩阵。对于小型 $d$ 和简单的问题,$\mathbf{H}$ 很容易计算。另一方面,对于深度神经网络而言,由于存储 $\mathcal{O}(d^2)$ 条条目的成本,$\mathbf{H}$ 可能太大。此外,通过反向传播进行计算可能太昂贵。现在让我们忽略这些考虑因素,看看我们会得到什么算法。 + +毕竟,最低的 $f$ 满足 $\nabla f = 0$。遵循 :numref:`subsec_calculus-grad` 中的微积分规则,采取 :eqref:`gd-hot-taylor` 的衍生品,对 $\boldsymbol{\epsilon}$ 的衍生品,忽略了我们得出的高阶条款 + +$$\nabla f(\mathbf{x}) + \mathbf{H} \boldsymbol{\epsilon} = 0 \text{ and hence } +\boldsymbol{\epsilon} = -\mathbf{H}^{-1} \nabla f(\mathbf{x}).$$ + +也就是说,作为优化问题的一部分,我们需要反转黑森州 $\mathbf{H}$。 + +作为一个简单的例子,对于 $f(x) = \frac{1}{2} x^2$,我们有 $\nabla f(x) = x$ 和 $\mathbf{H} = 1$。因此,对于任何 $x$,我们获得了 $\epsilon = -x$。换句话说,*Single* 步骤足以完美收敛而无需进行任何调整!唉,我们在这里有点幸运:泰勒的扩张自 $f(x+\epsilon)= \frac{1}{2} x^2 + \epsilon x + \frac{1}{2} \epsilon^2$ 以来就是准确的。 + +让我们看看其他问题会发生什么。给定一些常数 $c$ 的凸双余弦函数 $f(x) = \cosh(cx)$,我们可以看到,经过几次迭代后达到了 $x=0$ 的全局最低值。 + +```{.python .input} +#@tab all +c = d2l.tensor(0.5) + +def f(x): # Objective function + return d2l.cosh(c * x) + +def f_grad(x): # Gradient of the objective function + return c * d2l.sinh(c * x) + +def f_hess(x): # Hessian of the objective function + return c**2 * d2l.cosh(c * x) + +def newton(eta=1): + x = 10.0 + results = [x] + for i in range(10): + x -= eta * f_grad(x) / f_hess(x) + results.append(float(x)) + print('epoch 10, x:', x) + return results + +show_trace(newton(), f) +``` + +现在让我们考虑一个 * 非凸 * 函数,例如 $f(x) = x \cos(c x)$ 对于某些常数 $c$。毕竟,请注意,在牛顿的方法中,我们最终被黑森人划分。这意味着,如果第二个衍生品为 * 负 * 我们可能会走向 * 增加 * 值 $f$ 的方向。这是算法的一个致命缺陷。让我们看看实际中会发生什么。 + +```{.python .input} +#@tab all +c = d2l.tensor(0.15 * np.pi) + +def f(x): # Objective function + return x * d2l.cos(c * x) + +def f_grad(x): # Gradient of the objective function + return d2l.cos(c * x) - c * x * d2l.sin(c * x) + +def f_hess(x): # Hessian of the objective function + return - 2 * c * d2l.sin(c * x) - x * c**2 * d2l.cos(c * x) + +show_trace(newton(), f) +``` + +这出现了极大的错误。我们怎么能修复它?一种方法是通过取代其绝对值来 “修复” 黑森人。另一种策略是恢复学习率。这似乎破坏了目的,但并非完全。拥有二阶信息可以让我们在曲率较大时保持谨慎态度,并在客观功能更平坦的情况下采取更长的步骤。比如 $\eta = 0.5$,让我们来看看这是如何在稍低的学习率下工作的。正如我们所看到的那样,我们有一个非常有效的算法。 + +```{.python .input} +#@tab all +show_trace(newton(0.5), f) +``` + +### 收敛性分析 + +我们只分析牛顿方法的收敛率为一些凸和三倍可差分目标函数 $f$,其中第二个导数为非零值,即 $f'' > 0$。多变量证明是下面一维论点的直接延伸,省略了,因为它在直觉方面没有太大帮助。 + +用 $x^{(k)}$ 表示 $k^\mathrm{th}$ 迭代时 $x$ 的值,让 $e^{(k)} \stackrel{\mathrm{def}}{=} x^{(k)} - x^*$ 成为 $k^\mathrm{th}$ 迭代时与最优性的距离。通过泰勒扩张我们有条件 $f'(x^*) = 0$ 可以写成 + +$$0 = f'(x^{(k)} - e^{(k)}) = f'(x^{(k)}) - e^{(k)} f''(x^{(k)}) + \frac{1}{2} (e^{(k)})^2 f'''(\xi^{(k)}),$$ + +这支持了大约 $\xi^{(k)} \in [x^{(k)} - e^{(k)}, x^{(k)}]$。将上述扩张除以 $f''(x^{(k)})$ 收益率 + +$$e^{(k)} - \frac{f'(x^{(k)})}{f''(x^{(k)})} = \frac{1}{2} (e^{(k)})^2 \frac{f'''(\xi^{(k)})}{f''(x^{(k)})}.$$ + +回想一下,我们有更新 $x^{(k+1)} = x^{(k)} - f'(x^{(k)}) / f''(x^{(k)})$。插入这个更新方程式,并且考虑双方的绝对价值,我们有 + +$$\left|e^{(k+1)}\right| = \frac{1}{2}(e^{(k)})^2 \frac{\left|f'''(\xi^{(k)})\right|}{f''(x^{(k)})}.$$ + +因此,每当我们处于 $\left|f'''(\xi^{(k)})\right| / (2f''(x^{(k)})) \leq c$ 的边界区域时,我们都会出现二次递减的误差 + +$$\left|e^{(k+1)}\right| \leq c (e^{(k)})^2.$$ + +顺便说一句,优化研究人员称之为 * 线性 * 收敛,而像 $\left|e^{(k+1)}\right| \leq \alpha \left|e^{(k)}\right|$ 这样的条件将被称为 * 恒定 * 收敛率。请注意,此分析附带了一些注意事项。首先,我们实际上没有太多的保证,我们何时能够到达迅速趋同的区域。相反,我们只知道一旦我们达到这一目标,趋同将非常快。其次,这项分析要求 $f$ 在高阶衍生品之前表现良好。归结为确保 $f$ 在如何改变其价值方面没有任何 “令人惊讶的” 属性。 + +### 预处理 + +毫不奇怪,计算和存储完整的 Hessian 是非常昂贵的。因此,寻找替代办法是可取的。改善问题的一种方法是 * 先决条件 *。它避免了完整计算黑森语,但只计算 * 对角 * 条目。这会导致更新表单的算法 + +$$\mathbf{x} \leftarrow \mathbf{x} - \eta \mathrm{diag}(\mathbf{H})^{-1} \nabla f(\mathbf{x}).$$ + +尽管这不如完整的牛顿方法那么好,但它仍然比不使用它好得多。要了解为什么这可能是个好主意,请考虑一个变量表示以毫米为单位的高度,另一个变量表示高度(以千米为单位)。假设两种自然尺度都以米为单位,那么我们在参数化方面存在严重的不匹配。幸运的是,使用预处理消除了这一点使用梯度下降进行有效预处理等于为每个变量(矢量 $\mathbf{x}$ 的坐标)选择不同的学习率。正如我们稍后将看到的那样,预处理推动了随机梯度下降优化算法的一些创新。 + +### 使用线搜索进行渐变下降 + +梯度下降的关键问题之一是,我们可能会超过目标或进展不足。问题的一个简单解决方法是将行搜索结合梯度下降结合使用。也就是说,我们使用 $\nabla f(\mathbf{x})$ 给出的方向,然后对学习率 $\eta$ 最小化 $f(\mathbf{x} - \eta \nabla f(\mathbf{x}))$ 进行二进制搜索。 + +该算法迅速收敛(有关分析和证明,请参见 :cite:`Boyd.Vandenberghe.2004`)。但是,为了深度学习的目的,这并不是那么可行,因为行搜索的每一步都要求我们评估整个数据集的目标函数。这太昂贵了,难以完成。 + +## 摘要 + +* 学习率很重要。太大而且我们分歧,太小了,我们没有取得进展。 +* 渐变下降可能会陷入局部最小值。 +* 在高维度上,调整学习率很复杂。 +* 预处理可以帮助调整比例。 +* 牛顿的方法一旦开始在凸出的问题中正常工作,就会快得多。 +* 小心使用牛顿的方法而不对非凸问题进行任何调整。 + +## 练习 + +1. 尝试不同的学习率和客观函数来实现梯度下降。 +1. 在 $[a, b]$ 的时间间隔内实施行搜索以最大限度地减少凸函数。 + 1. 你是否需要衍生品进行二进制搜索,即决定是选择 $[a, (a+b)/2]$ 还是 $[(a+b)/2, b]$。 + 1. 算法的收敛速度有多快? + 1. 实施算法并将其应用到最小化 $\log (\exp(x) + \exp(-2x -3))$。 +1. 设计 $\mathbb{R}^2$ 上定义的客观函数,其中梯度下降速度非常缓慢。提示:不同的缩放不同的坐标。 +1. 使用预处理实现牛顿方法的轻量级版本: + 1. 使用对角线 Hessian 作为预调器。 + 1. 使用该值的绝对值,而不是实际(可能有符号)值。 + 1. 将此应用于上述问题。 +1. 将上述算法应用于许多客观函数(凸与否)。如果你将坐标旋转 $45$ 度会发生什么? + +[Discussions](https://discuss.d2l.ai/t/351) diff --git a/chapter_optimization/gd_origin.md b/chapter_optimization/gd_origin.md new file mode 100644 index 000000000..c80b415e7 --- /dev/null +++ b/chapter_optimization/gd_origin.md @@ -0,0 +1,348 @@ +# Gradient Descent +:label:`sec_gd` + +In this section we are going to introduce the basic concepts underlying *gradient descent*. +Although it is rarely used directly in deep learning, an understanding of gradient descent is key to understanding stochastic gradient descent algorithms. +For instance, the optimization problem might diverge due to an overly large learning rate. This phenomenon can already be seen in gradient descent. Likewise, preconditioning is a common technique in gradient descent and carries over to more advanced algorithms. +Let us start with a simple special case. + + +## One-Dimensional Gradient Descent + +Gradient descent in one dimension is an excellent example to explain why the gradient descent algorithm may reduce the value of the objective function. Consider some continuously differentiable real-valued function $f: \mathbb{R} \rightarrow \mathbb{R}$. Using a Taylor expansion we obtain + +$$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2).$$ +:eqlabel:`gd-taylor` + +That is, in first-order approximation $f(x+\epsilon)$ is given by the function value $f(x)$ and the first derivative $f'(x)$ at $x$. It is not unreasonable to assume that for small $\epsilon$ moving in the direction of the negative gradient will decrease $f$. To keep things simple we pick a fixed step size $\eta > 0$ and choose $\epsilon = -\eta f'(x)$. Plugging this into the Taylor expansion above we get + +$$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x)).$$ +:eqlabel:`gd-taylor-2` + +If the derivative $f'(x) \neq 0$ does not vanish we make progress since $\eta f'^2(x)>0$. Moreover, we can always choose $\eta$ small enough for the higher-order terms to become irrelevant. Hence we arrive at + +$$f(x - \eta f'(x)) \lessapprox f(x).$$ + +This means that, if we use + +$$x \leftarrow x - \eta f'(x)$$ + +to iterate $x$, the value of function $f(x)$ might decline. Therefore, in gradient descent we first choose an initial value $x$ and a constant $\eta > 0$ and then use them to continuously iterate $x$ until the stop condition is reached, for example, when the magnitude of the gradient $|f'(x)|$ is small enough or the number of iterations has reached a certain value. + +For simplicity we choose the objective function $f(x)=x^2$ to illustrate how to implement gradient descent. Although we know that $x=0$ is the solution to minimize $f(x)$, we still use this simple function to observe how $x$ changes. + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import numpy as np +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import numpy as np +import tensorflow as tf +``` + +```{.python .input} +#@tab all +def f(x): # Objective function + return x ** 2 + +def f_grad(x): # Gradient (derivative) of the objective function + return 2 * x +``` + +Next, we use $x=10$ as the initial value and assume $\eta=0.2$. Using gradient descent to iterate $x$ for 10 times we can see that, eventually, the value of $x$ approaches the optimal solution. + +```{.python .input} +#@tab all +def gd(eta, f_grad): + x = 10.0 + results = [x] + for i in range(10): + x -= eta * f_grad(x) + results.append(float(x)) + print(f'epoch 10, x: {x:f}') + return results + +results = gd(0.2, f_grad) +``` + +The progress of optimizing over $x$ can be plotted as follows. + +```{.python .input} +#@tab all +def show_trace(results, f): + n = max(abs(min(results)), abs(max(results))) + f_line = d2l.arange(-n, n, 0.01) + d2l.set_figsize() + d2l.plot([f_line, results], [[f(x) for x in f_line], [ + f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o']) + +show_trace(results, f) +``` + +### Learning Rate +:label:`subsec_gd-learningrate` + +The learning rate $\eta$ can be set by the algorithm designer. If we use a learning rate that is too small, it will cause $x$ to update very slowly, requiring more iterations to get a better solution. To show what happens in such a case, consider the progress in the same optimization problem for $\eta = 0.05$. As we can see, even after 10 steps we are still very far from the optimal solution. + +```{.python .input} +#@tab all +show_trace(gd(0.05, f_grad), f) +``` + +Conversely, if we use an excessively high learning rate, $\left|\eta f'(x)\right|$ might be too large for the first-order Taylor expansion formula. That is, the term $\mathcal{O}(\eta^2 f'^2(x))$ in :eqref:`gd-taylor-2` might become significant. In this case, we cannot guarantee that the iteration of $x$ will be able to lower the value of $f(x)$. For example, when we set the learning rate to $\eta=1.1$, $x$ overshoots the optimal solution $x=0$ and gradually diverges. + +```{.python .input} +#@tab all +show_trace(gd(1.1, f_grad), f) +``` + +### Local Minima + +To illustrate what happens for nonconvex functions consider the case of $f(x) = x \cdot \cos(cx)$ for some constant $c$. This function has infinitely many local minima. Depending on our choice of the learning rate and depending on how well conditioned the problem is, we may end up with one of many solutions. The example below illustrates how an (unrealistically) high learning rate will lead to a poor local minimum. + +```{.python .input} +#@tab all +c = d2l.tensor(0.15 * np.pi) + +def f(x): # Objective function + return x * d2l.cos(c * x) + +def f_grad(x): # Gradient of the objective function + return d2l.cos(c * x) - c * x * d2l.sin(c * x) + +show_trace(gd(2, f_grad), f) +``` + +## Multivariate Gradient Descent + +Now that we have a better intuition of the univariate case, let us consider the situation where $\mathbf{x} = [x_1, x_2, \ldots, x_d]^\top$. That is, the objective function $f: \mathbb{R}^d \to \mathbb{R}$ maps vectors into scalars. Correspondingly its gradient is multivariate, too. It is a vector consisting of $d$ partial derivatives: + +$$\nabla f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}\bigg]^\top.$$ + +Each partial derivative element $\partial f(\mathbf{x})/\partial x_i$ in the gradient indicates the rate of change of $f$ at $\mathbf{x}$ with respect to the input $x_i$. As before in the univariate case we can use the corresponding Taylor approximation for multivariate functions to get some idea of what we should do. In particular, we have that + +$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \mathbf{\boldsymbol{\epsilon}}^\top \nabla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2).$$ +:eqlabel:`gd-multi-taylor` + +In other words, up to second-order terms in $\boldsymbol{\epsilon}$ the direction of steepest descent is given by the negative gradient $-\nabla f(\mathbf{x})$. Choosing a suitable learning rate $\eta > 0$ yields the prototypical gradient descent algorithm: + +$$\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f(\mathbf{x}).$$ + +To see how the algorithm behaves in practice let us construct an objective function $f(\mathbf{x})=x_1^2+2x_2^2$ with a two-dimensional vector $\mathbf{x} = [x_1, x_2]^\top$ as input and a scalar as output. The gradient is given by $\nabla f(\mathbf{x}) = [2x_1, 4x_2]^\top$. We will observe the trajectory of $\mathbf{x}$ by gradient descent from the initial position $[-5, -2]$. + +To begin with, we need two more helper functions. The first uses an update function and applies it 20 times to the initial value. The second helper visualizes the trajectory of $\mathbf{x}$. + +```{.python .input} +#@tab all +def train_2d(trainer, steps=20, f_grad=None): #@save + """Optimize a 2D objective function with a customized trainer.""" + # `s1` and `s2` are internal state variables that will be used later + x1, x2, s1, s2 = -5, -2, 0, 0 + results = [(x1, x2)] + for i in range(steps): + if f_grad: + x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad) + else: + x1, x2, s1, s2 = trainer(x1, x2, s1, s2) + results.append((x1, x2)) + print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}') + return results + +def show_trace_2d(f, results): #@save + """Show the trace of 2D variables during optimization.""" + d2l.set_figsize() + d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e') + x1, x2 = d2l.meshgrid(d2l.arange(-5.5, 1.0, 0.1), + d2l.arange(-3.0, 1.0, 0.1)) + d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4') + d2l.plt.xlabel('x1') + d2l.plt.ylabel('x2') +``` + +Next, we observe the trajectory of the optimization variable $\mathbf{x}$ for learning rate $\eta = 0.1$. We can see that after 20 steps the value of $\mathbf{x}$ approaches its minimum at $[0, 0]$. Progress is fairly well-behaved albeit rather slow. + +```{.python .input} +#@tab all +def f_2d(x1, x2): # Objective function + return x1 ** 2 + 2 * x2 ** 2 + +def f_2d_grad(x1, x2): # Gradient of the objective function + return (2 * x1, 4 * x2) + +def gd_2d(x1, x2, s1, s2, f_grad): + g1, g2 = f_grad(x1, x2) + return (x1 - eta * g1, x2 - eta * g2, 0, 0) + +eta = 0.1 +show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad)) +``` + +## Adaptive Methods + +As we could see in :numref:`subsec_gd-learningrate`, getting the learning rate $\eta$ "just right" is tricky. If we pick it too small, we make little progress. If we pick it too large, the solution oscillates and in the worst case it might even diverge. What if we could determine $\eta$ automatically or get rid of having to select a learning rate at all? +Second-order methods that look not only at the value and gradient of the objective function +but also at its *curvature* can help in this case. While these methods cannot be applied to deep learning directly due to the computational cost, they provide useful intuition into how to design advanced optimization algorithms that mimic many of the desirable properties of the algorithms outlined below. + + +### Newton's Method + +Reviewing the Taylor expansion of some function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ there is no need to stop after the first term. In fact, we can write it as + +$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^\top \nabla f(\mathbf{x}) + \frac{1}{2} \boldsymbol{\epsilon}^\top \nabla^2 f(\mathbf{x}) \boldsymbol{\epsilon} + \mathcal{O}(\|\boldsymbol{\epsilon}\|^3).$$ +:eqlabel:`gd-hot-taylor` + +To avoid cumbersome notation we define $\mathbf{H} \stackrel{\mathrm{def}}{=} \nabla^2 f(\mathbf{x})$ to be the Hessian of $f$, which is a $d \times d$ matrix. For small $d$ and simple problems $\mathbf{H}$ is easy to compute. For deep neural networks, on the other hand, $\mathbf{H}$ may be prohibitively large, due to the cost of storing $\mathcal{O}(d^2)$ entries. Furthermore it may be too expensive to compute via backpropagation. For now let us ignore such considerations and look at what algorithm we would get. + +After all, the minimum of $f$ satisfies $\nabla f = 0$. +Following calculus rules in :numref:`subsec_calculus-grad`, +by taking derivatives of :eqref:`gd-hot-taylor` with regard to $\boldsymbol{\epsilon}$ and ignoring higher-order terms we arrive at + +$$\nabla f(\mathbf{x}) + \mathbf{H} \boldsymbol{\epsilon} = 0 \text{ and hence } +\boldsymbol{\epsilon} = -\mathbf{H}^{-1} \nabla f(\mathbf{x}).$$ + +That is, we need to invert the Hessian $\mathbf{H}$ as part of the optimization problem. + +As a simple example, for $f(x) = \frac{1}{2} x^2$ we have $\nabla f(x) = x$ and $\mathbf{H} = 1$. Hence for any $x$ we obtain $\epsilon = -x$. In other words, a *single* step is sufficient to converge perfectly without the need for any adjustment! Alas, we got a bit lucky here: the Taylor expansion was exact since $f(x+\epsilon)= \frac{1}{2} x^2 + \epsilon x + \frac{1}{2} \epsilon^2$. + +Let us see what happens in other problems. +Given a convex hyperbolic cosine function $f(x) = \cosh(cx)$ for some constant $c$, we can see that +the global minimum at $x=0$ is reached +after a few iterations. + +```{.python .input} +#@tab all +c = d2l.tensor(0.5) + +def f(x): # Objective function + return d2l.cosh(c * x) + +def f_grad(x): # Gradient of the objective function + return c * d2l.sinh(c * x) + +def f_hess(x): # Hessian of the objective function + return c**2 * d2l.cosh(c * x) + +def newton(eta=1): + x = 10.0 + results = [x] + for i in range(10): + x -= eta * f_grad(x) / f_hess(x) + results.append(float(x)) + print('epoch 10, x:', x) + return results + +show_trace(newton(), f) +``` + +Now let us consider a *nonconvex* function, such as $f(x) = x \cos(c x)$ for some constant $c$. After all, note that in Newton's method we end up dividing by the Hessian. This means that if the second derivative is *negative* we may walk into the direction of *increasing* the value of $f$. +That is a fatal flaw of the algorithm. +Let us see what happens in practice. + +```{.python .input} +#@tab all +c = d2l.tensor(0.15 * np.pi) + +def f(x): # Objective function + return x * d2l.cos(c * x) + +def f_grad(x): # Gradient of the objective function + return d2l.cos(c * x) - c * x * d2l.sin(c * x) + +def f_hess(x): # Hessian of the objective function + return - 2 * c * d2l.sin(c * x) - x * c**2 * d2l.cos(c * x) + +show_trace(newton(), f) +``` + +This went spectacularly wrong. How can we fix it? One way would be to "fix" the Hessian by taking its absolute value instead. Another strategy is to bring back the learning rate. This seems to defeat the purpose, but not quite. Having second-order information allows us to be cautious whenever the curvature is large and to take longer steps whenever the objective function is flatter. +Let us see how this works with a slightly smaller learning rate, say $\eta = 0.5$. As we can see, we have quite an efficient algorithm. + +```{.python .input} +#@tab all +show_trace(newton(0.5), f) +``` + +### Convergence Analysis + +We only analyze the convergence rate of Newton's method for some convex and three times differentiable objective function $f$, where the second derivative is nonzero, i.e., $f'' > 0$. The multivariate proof is a straightforward extension of the one-dimensional argument below and omitted since it does not help us much in terms of intuition. + +Denote by $x^{(k)}$ the value of $x$ at the $k^\mathrm{th}$ iteration and let $e^{(k)} \stackrel{\mathrm{def}}{=} x^{(k)} - x^*$ be the distance from optimality at the $k^\mathrm{th}$ iteration. By Taylor expansion we have that the condition $f'(x^*) = 0$ can be written as + +$$0 = f'(x^{(k)} - e^{(k)}) = f'(x^{(k)}) - e^{(k)} f''(x^{(k)}) + \frac{1}{2} (e^{(k)})^2 f'''(\xi^{(k)}),$$ + +which holds for some $\xi^{(k)} \in [x^{(k)} - e^{(k)}, x^{(k)}]$. Dividing the above expansion by $f''(x^{(k)})$ yields + +$$e^{(k)} - \frac{f'(x^{(k)})}{f''(x^{(k)})} = \frac{1}{2} (e^{(k)})^2 \frac{f'''(\xi^{(k)})}{f''(x^{(k)})}.$$ + +Recall that we have the update $x^{(k+1)} = x^{(k)} - f'(x^{(k)}) / f''(x^{(k)})$. +Plugging in this update equation and taking the absolute value of both sides, we have + +$$\left|e^{(k+1)}\right| = \frac{1}{2}(e^{(k)})^2 \frac{\left|f'''(\xi^{(k)})\right|}{f''(x^{(k)})}.$$ + +Consequently, whenever we are in a region of bounded $\left|f'''(\xi^{(k)})\right| / (2f''(x^{(k)})) \leq c$, we have a quadratically decreasing error + +$$\left|e^{(k+1)}\right| \leq c (e^{(k)})^2.$$ + + +As an aside, optimization researchers call this *linear* convergence, whereas a condition such as $\left|e^{(k+1)}\right| \leq \alpha \left|e^{(k)}\right|$ would be called a *constant* rate of convergence. +Note that this analysis comes with a number of caveats. +First, we do not really have much of a guarantee when we will reach the region of rapid convergence. Instead, we only know that once we reach it, convergence will be very quick. Second, this analysis requires that $f$ is well-behaved up to higher-order derivatives. It comes down to ensuring that $f$ does not have any "surprising" properties in terms of how it might change its values. + + + +### Preconditioning + +Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desirable to find alternatives. One way to improve matters is *preconditioning*. It avoids computing the Hessian in its entirety but only computes the *diagonal* entries. This leads to update algorithms of the form + +$$\mathbf{x} \leftarrow \mathbf{x} - \eta \mathrm{diag}(\mathbf{H})^{-1} \nabla f(\mathbf{x}).$$ + + +While this is not quite as good as the full Newton's method, it is still much better than not using it. +To see why this might be a good idea consider a situation where one variable denotes height in millimeters and the other one denotes height in kilometers. Assuming that for both the natural scale is in meters, we have a terrible mismatch in parameterizations. Fortunately, using preconditioning removes this. Effectively preconditioning with gradient descent amounts to selecting a different learning rate for each variable (coordinate of vector $\mathbf{x}$). +As we will see later, preconditioning drives some of the innovation in stochastic gradient descent optimization algorithms. + + +### Gradient Descent with Line Search + +One of the key problems in gradient descent is that we might overshoot the goal or make insufficient progress. A simple fix for the problem is to use line search in conjunction with gradient descent. That is, we use the direction given by $\nabla f(\mathbf{x})$ and then perform binary search as to which learning rate $\eta$ minimizes $f(\mathbf{x} - \eta \nabla f(\mathbf{x}))$. + +This algorithm converges rapidly (for an analysis and proof see e.g., :cite:`Boyd.Vandenberghe.2004`). However, for the purpose of deep learning this is not quite so feasible, since each step of the line search would require us to evaluate the objective function on the entire dataset. This is way too costly to accomplish. + +## Summary + +* Learning rates matter. Too large and we diverge, too small and we do not make progress. +* Gradient descent can get stuck in local minima. +* In high dimensions adjusting the learning rate is complicated. +* Preconditioning can help with scale adjustment. +* Newton's method is a lot faster once it has started working properly in convex problems. +* Beware of using Newton's method without any adjustments for nonconvex problems. + +## Exercises + +1. Experiment with different learning rates and objective functions for gradient descent. +1. Implement line search to minimize a convex function in the interval $[a, b]$. + 1. Do you need derivatives for binary search, i.e., to decide whether to pick $[a, (a+b)/2]$ or $[(a+b)/2, b]$. + 1. How rapid is the rate of convergence for the algorithm? + 1. Implement the algorithm and apply it to minimizing $\log (\exp(x) + \exp(-2x -3))$. +1. Design an objective function defined on $\mathbb{R}^2$ where gradient descent is exceedingly slow. Hint: scale different coordinates differently. +1. Implement the lightweight version of Newton's method using preconditioning: + 1. Use diagonal Hessian as preconditioner. + 1. Use the absolute values of that rather than the actual (possibly signed) values. + 1. Apply this to the problem above. +1. Apply the algorithm above to a number of objective functions (convex or not). What happens if you rotate coordinates by $45$ degrees? + +[Discussions](https://discuss.d2l.ai/t/351) diff --git a/chapter_optimization/index.md b/chapter_optimization/index.md index e0833c653..91852e96c 100644 --- a/chapter_optimization/index.md +++ b/chapter_optimization/index.md @@ -1,18 +1,11 @@ # 优化算法 :label:`chap_optimization` -到目前为止,如果你按顺序阅读本书,你已经学会使用许多优化算法来训练深度学习模型。 -它们是允许我们继续更新模型参数和最小化损失函数值的工具。 -的确,很多人都愿意将优化视为“黑盒设备”,拥有一些使用深度学习优化“魔法”的知识,就能够基于简单的设置实现目标函数的最小化。 +如果您在此之前按顺序阅读这本书,则已经使用了许多优化算法来训练深度学习模型。这些工具使我们能够继续更新模型参数并最大限度地减少损失函数的价值,正如培训集评估的那样。事实上,任何人满意将优化视为黑盒装置,以便在简单的环境中最大限度地减少客观功能,都可能会知道存在着一系列此类程序的咒语(名称如 “SGD” 和 “亚当”)。 -然而,优化算法对于深度学习是很重要的,因此学习一些更深层次的知识可以更好地优化。 -一方面,训练一个复杂的深度学习模型可能需要数小时、数天甚至数周的时间,而优化算法的性能将直接影响模型的训练效率。 -另一方面,了解不同优化算法的原理及其超参数的作用,可以有针对性地调整超参数,提高深度学习模型的性能。 +但是,为了做得好,还需要更深入的知识。优化算法对于深度学习非常重要。一方面,训练复杂的深度学习模型可能需要数小时、几天甚至数周。优化算法的性能直接影响模型的训练效率。另一方面,了解不同优化算法的原则及其超参数的作用将使我们能够以有针对性的方式调整超参数,以提高深度学习模型的性能。 -在本章中,我们将深入探讨常见的深度学习优化算法。 -在深度学习中,几乎所有的优化问题都是 *非凸的*(nonconvex)。 -尽管如此,在 *凸问题* 的背景下设计和分析算法已经被证明是非常有益的。 -基于这个原因,本章包括了关于凸优化的入门,和一个非常简单的随机梯度下降算法在凸目标函数上的证明。 +在本章中,我们深入探讨常见的深度学习优化算法。深度学习中出现的几乎所有优化问题都是 * nonconvex*。尽管如此,在 *CONVex* 问题背景下设计和分析算法是非常有启发性的。正是出于这个原因,本章包括了凸优化的入门,以及凸目标函数上非常简单的随机梯度下降算法的证明。 ```toc :maxdepth: 2 @@ -28,4 +21,4 @@ rmsprop adadelta adam lr-scheduler -``` \ No newline at end of file +``` diff --git a/chapter_optimization/index_origin.md b/chapter_optimization/index_origin.md new file mode 100644 index 000000000..dd252c0c6 --- /dev/null +++ b/chapter_optimization/index_origin.md @@ -0,0 +1,34 @@ +# Optimization Algorithms +:label:`chap_optimization` + +If you read the book in sequence up to this point you already used a number of optimization algorithms to train deep learning models. +They were the tools that allowed us to continue updating model parameters and to minimize the value of the loss function, as evaluated on the training set. Indeed, anyone content with treating optimization as a black box device to minimize objective functions in a simple setting might well content oneself with the knowledge that there exists an array of incantations of such a procedure (with names such as "SGD" and "Adam"). + +To do well, however, some deeper knowledge is required. +Optimization algorithms are important for deep learning. +On one hand, training a complex deep learning model can take hours, days, or even weeks. +The performance of the optimization algorithm directly affects the model's training efficiency. +On the other hand, understanding the principles of different optimization algorithms and the role of their hyperparameters +will enable us to tune the hyperparameters in a targeted manner to improve the performance of deep learning models. + +In this chapter, we explore common deep learning optimization algorithms in depth. +Almost all optimization problems arising in deep learning are *nonconvex*. +Nonetheless, the design and analysis of algorithms in the context of *convex* problems have proven to be very instructive. +It is for that reason that this chapter includes a primer on convex optimization and the proof for a very simple stochastic gradient descent algorithm on a convex objective function. + +```toc +:maxdepth: 2 + +optimization-intro +convexity +gd +sgd +minibatch-sgd +momentum +adagrad +rmsprop +adadelta +adam +lr-scheduler +``` + diff --git a/chapter_optimization/optimization-intro.md b/chapter_optimization/optimization-intro.md index f533dbb57..868588fe0 100644 --- a/chapter_optimization/optimization-intro.md +++ b/chapter_optimization/optimization-intro.md @@ -1,18 +1,10 @@ - +# 优化和深度学习 -# 最优化与深度学习 +在本节中,我们将讨论优化与深度学习之间的关系以及在深度学习中使用优化的挑战。对于深度学习问题,我们通常会先定义 * 损失函数 *。一旦我们有了损失函数,我们就可以使用优化算法来尽量减少损失。在优化中,损失函数通常被称为优化问题的 * 目标函数 *。按照传统和惯则,大多数优化算法都关注的是 * 最小化 *。如果我们需要最大限度地实现目标,那么有一个简单的解决方案:只需翻转目标上的标志即可。 -在本节中,我们将讨论最优化与深度学习之间的关系,以及在深度学习中使用最优化所面临的挑战。 -对于深度学习问题,通常先定义一个 *损失函数*(loss function)。一旦有了损失函数,我们就可以使用一个最优化算法来尝试最小化损失。在最优化中,损失函数通常被称为最优化问题的 *目标函数*(objective function)。根据传统和惯例,大多数最优化算法都与 *最小化*(minimization)有关。如果我们需要最大化一个目标函数,有一个简单的解决方案:只要翻转目标函数前面的符号即可。 +## 优化的目标 -## 优化目标 - -虽然最优化为深度学习提供了一种最小化损失函数的方法,但从本质上讲,最优化和深度学习的目标是完全不同的。 -前者主要关注最小化目标函数,而后者则关注在给定有限数据量的情况下找到合适的模型。 -在 :numref:`sec_model_selection` 中,我们详细讨论了这两个目标之间的差异。 -例如,通常情况下训练误差和泛化误差是不同的:因为最优化算法的目标函数一般是基于训练数据集的损失函数,所以最优化的目标是减少训练误差。 -然而,深度学习(或者广义上说,统计推断)的目标是减少泛化误差。 -为了实现后者的目标,除了使用最优化算法来减少训练误差外,还需要注意过拟合问题。 +尽管优化提供了一种最大限度地减少深度学习损失功能的方法,但实质上,优化和深度学习的目标是根本不同的。前者主要关注的是尽量减少一个目标,而鉴于数据量有限,后者则关注寻找合适的模型。在 :numref:`sec_model_selection` 中,我们详细讨论了这两个目标之间的区别。例如,训练错误和泛化错误通常不同:由于优化算法的客观函数通常是基于训练数据集的损失函数,因此优化的目标是减少训练错误。但是,深度学习(或更广义地说,统计推断)的目标是减少概括错误。为了完成后者,除了使用优化算法来减少训练错误之外,我们还需要注意过度拟合。 ```{.python .input} %matplotlib inline @@ -40,10 +32,7 @@ from mpl_toolkits import mplot3d import tensorflow as tf ``` -为了说明上述的不同目标,让我们考虑经验风险和风险。 -如 :numref:`subsec_empirical-risk-and-risk` 所描述的,经验风险是训练数据集上的平均损失,而风险是全体数据的期望损失。 -接下来我们定义两个函数,风险函数 $f$ 和经验风险函数 $g$。 -假设我们拥有的训练数据的数量是有限的,因此函数 $g$ 不如 $f$ 平滑。 +为了说明上述不同的目标,让我们考虑经验风险和风险。如 :numref:`subsec_empirical-risk-and-risk` 所述,经验风险是训练数据集的平均损失,而风险则是整个数据群的预期损失。下面我们定义了两个函数:风险函数 `f` 和经验风险函数 `g`。假设我们只有有限量的训练数据。因此,这里的 `g` 不如 `f` 平滑。 ```{.python .input} #@tab all @@ -54,7 +43,7 @@ def g(x): return f(x) + 0.2 * d2l.cos(5 * np.pi * x) ``` -下图说明了,在训练数据集上,经验风险的最小值可能与风险(泛化误差)的最小值不在相同的位置。 +下图说明,训练数据集的最低经验风险可能与最低风险(概括错误)不同。 ```{.python .input} #@tab all @@ -69,26 +58,21 @@ annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1)) annotate('min of risk', (1.1, -1.05), (0.95, -0.5)) ``` -## 深度学习中的最优化挑战 +## 深度学习中的优化挑战 -在本章中,我们将特别关注最优化算法在最小化目标函数方面的性能,而不是模型的泛化误差。 -在 :numref:`sec_linear_regression` 中,我们对比了最优化问题的解析解和数值解。 -在深度学习中,大多数目标函数是复杂的、没有解析解的。 -因此,我们必须使用本章所描述的数值最优化算法来代替解析算法。 +在本章中,我们将特别关注优化算法在最小化目标函数方面的性能,而不是模型的泛化错误。在 :numref:`sec_linear_regression` 中,我们区分了优化问题中的分析解和数值解。在深度学习中,大多数客观的功能都很复杂,没有分析解决方案。相反,我们必须使用数值优化算法。本章中的优化算法都属于此类别。 -深度学习的最优化面临许多挑战,其中最令人烦恼的是局部极小值、鞍点和消失梯度。 -下面我们将具体了解这些挑战。 +深度学习优化存在许多挑战。其中一些最令人恼人的是局部最小值、鞍点和消失的渐变。让我们来看看它们。 -### 局部最小值 +### 本地迷你 -对于目标函数 $f(x)$,如果 $x$ 处的 $f(x)$ 值小于 $x$ 附近任何其他点的 $f(x)$ 值,则 $f(x)$ 可以是 *局部最小值*(local minimum)。 -如果在 $x$ 处 $f(x)$ 的值是目标函数在整个域上的最小值,则 $f(x)$ 是 *全局最小值*(global minimum)。 +对于任何客观函数 $f(x)$,如果 $f(x)$ 的值 $f(x)$ 在 $x$ 附近的任何其他点小于 $f(x)$ 的值,那么 $f(x)$ 在 $x$ 附近的任何其他点的值小于 $f(x)$,那么 $f(x)$ 可能是局部最低值。如果 $f(x)$ 的值为 $f(x)$,为整个域的目标函数的最小值,那么 $f(x)$ 是全局最小值。 -例如,给定函数 +例如,给定函数 $$f(x) = x \cdot \text{cos}(\pi x) \text{ for } -1.0 \leq x \leq 2.0,$$ -我们可以近似这个函数的局部最小值和全局最小值。 +我们可以接近该函数的局部最小值和全局最小值。 ```{.python .input} #@tab all @@ -98,16 +82,11 @@ annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0)) annotate('global minimum', (1.1, -0.95), (0.6, 0.8)) ``` -通常深度学习模型的目标函数具有许多局部最优解。 -当最优化问题的数值解接近局部最优解时,会导致求解目标函数的梯度趋于或者变为零,此时通过最终迭代得到的数值解只可能使目标函数 *局部最小化*(locally),而不是 *全局最小化*(globally)。 -只有一定程度的噪声才能使参数脱离局部极小值。 -事实上,小批量随机梯度下降的一个有利性质就是基于小批量上的梯度的自然变化能够强行将参数从局部极小值中移出。 +深度学习模型的客观功能通常有许多局部最佳值。当优化问题的数值解近于局部最佳值时,最终迭代获得的数值解可能只能最小化目标函数 * 本地 *,而不是随着目标函数解的梯度接近或变为零而不是 * 全局 *。只有一定程度的噪音可能会使参数从当地的最低值中排除出来。事实上,这是迷你批随机梯度下降的有益特性之一,在这种情况下,迷你匹配的渐变的自然变化能够从局部最小值中移除参数。 -### 鞍点 +### 鞍积分 -除了局部极小值,鞍点是梯度消失的另一个原因。*鞍点*(saddle point)也是函数的所有梯度都消失的位置,但这个位置既不是全局最小值也不是局部最小值。 -考虑函数 $f(x) = x^3$,它的一阶导数和二阶导数在 $x=0$ 处消失。 -即使 $x$ 不是最小值,优化也可能在这个点上停止。 +除了局部最小值之外,鞍点也是梯度消失的另一个原因。* 鞍点 * 是指函数的所有渐变都消失但既不是全局也不是局部最小值的任何位置。考虑这个函数 $f(x) = x^3$。它的第一个和第二个衍生品消失了 $x=0$。这时优化可能会停顿,尽管它不是最低限度。 ```{.python .input} #@tab all @@ -116,9 +95,7 @@ d2l.plot(x, [x**3], 'x', 'f(x)') annotate('saddle point', (0, -0.2), (-0.52, -5.0)) ``` -如下例所示,更高维度中的鞍点将更加隐蔽。考虑函数 $f(x, y) = x^2 - y^2$, -它的鞍点在 $(0, 0)$,这是 $y$ 的最大值,$x$ 的最小值。 -而且,它看起来像一个马鞍,这也就是这个数学性质命名的原因。 +如下例所示,较高尺寸的鞍点甚至更加阴险。考虑这个函数 $f(x, y) = x^2 - y^2$。它的鞍点为 $(0, 0)$。这是相对于 $y$ 的最高值,最低为 $x$。此外,它 * 看起来像马鞍,这就是这个数学属性的名字的地方。 ```{.python .input} #@tab all @@ -137,24 +114,17 @@ d2l.plt.xlabel('x') d2l.plt.ylabel('y'); ``` -我们假设一个函数的输入是一个 $k$ 维向量,其输出是一个标量,因此它的 Hessian 矩阵将有 $k$ 个特征值(参见 :numref:`sec_geometry-linear-algebraic-ops`)。 -函数的解可以是局部最小值、局部最大值或者鞍点,解所在位置的函数梯度为零: +我们假设函数的输入是 $k$ 维矢量,其输出是标量,因此其黑森州矩阵将有 $k$ 特征值(参考 [online appendix on eigendecompositions](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/eigendecomposition.html))。函数的解决方案可以是局部最小值、局部最大值或函数梯度为零的位置的鞍点: -* 当函数的 Hessian 矩阵在零梯度位置的特征值都为正时,我们得到了函数的局部极小值。 -* 当函数的 Hessian 矩阵在零梯度位置的特征值都为负时,我们得到了函数的局部极大值。 -* 当函数的 Hessian 矩阵在零梯度位置的特征值有负有正时,我们得到了函数的鞍点。 +* 当函数在零梯度位置处的 Hessian 矩阵的特征值全部为正值时,我们有该函数的局部最小值。 +* 当函数在零梯度位置处的 Hessian 矩阵的特征值全部为负值时,我们有该函数的局部最大值。 +* 当函数在零梯度位置处的 Hessian 矩阵的特征值为负值和正值时,我们对函数有一个鞍点。 -对于高维问题,至少某些特征值为负的可能性是相当高的,因此得到函数的鞍点比局部极小值的可能性更高。 -在下一节介绍凸性时,我们将讨论这种形势下的一些例外情况。简而言之,*凸函数* 是那些 Hessian 函数的特征值从不为负的函数。 -遗憾的是,大多数的深度学习问题都不属于这类函数。然而,它仍然是研究优化算法一个伟大的工具。 +对于高维度问题,至少 * 部分 * 特征值为负的可能性相当高。这使得马鞍点比本地最小值更有可能。介绍凸体时,我们将在下一节中讨论这种情况的一些例外情况。简而言之,凸函数是黑森人的特征值永远不是负值的函数。但是,可悲的是,大多数深度学习问题并不属于这个类别。尽管如此,这是研究优化算法的好工具。 -### 梯度消失 +### 消失渐变 -回忆一下我们常用的激活函数及它们的导数 :numref:`subsec_activation-functions`,*梯度消失*(vanishing gradients)可能是会遇到的最隐蔽的问题。 -举个例子,假设我们想从 $x = 4$ 开始最小化函数 $f(x) = \tanh(x)$。 -如我们所见,$f$ 的梯度接近于零,更具体地说就是$f'(x) = 1 - \tanh^2(x)$ 和 $f'(4) = 0.0013$。 -结果,在我们取得进展之前,优化将被困在那个位置很长一段时间。 -这就是为什么深度学习模型的训练在引入 ReLU 激活函数之前相当棘手的原因之一。 +可能遇到的最阴险的问题是渐变消失。回想一下我们在 :numref:`subsec_activation-functions` 中常用的激活函数及其衍生品。例如,假设我们想尽量减少函数 $f(x) = \tanh(x)$,然后我们恰好从 $x = 4$ 开始。正如我们所看到的那样,$f$ 的梯度接近零。更具体地说,$f'(x) = 1 - \tanh^2(x)$,因此是 $f'(4) = 0.0013$。因此,在我们取得进展之前,优化将会停滞很长一段时间。事实证明,这是在引入 RELU 激活功能之前训练深度学习模型相当棘手的原因之一。 ```{.python .input} #@tab all @@ -163,28 +133,25 @@ d2l.plot(x, [d2l.tanh(x)], 'x', 'f(x)') annotate('vanishing gradient', (4, 1), (2, 0.0)) ``` -正如我们所看到的,深度学习的优化充满了挑战。 -幸运的是,存在一个强大的、表现良好的、即使对于初学者也易于使用的算法范围。 -此外,没有必要找到最佳解决方案,因为局部最优解甚至近似解仍然是非常有用的。 - +正如我们所看到的那样,深度学习的优化充满挑战。幸运的是,有一系列强大的算法表现良好,即使对于初学者也很容易使用。此外,没有必要找到 * 最佳解决方案。本地最佳甚至其近似的解决方案仍然非常有用。 -## 小结 +## 摘要 -* 最小化训练误差并不能保证我们找到一组最佳的参数来最小化泛化误差。 -* 最优化问题可能存在许多局部极小值。 -* 因为通常情况下机器学习问题都不是凸性的,所以优化问题可能有许多鞍点。 -* 梯度消失会导致优化停滞。通常问题的重新参数化会有所帮助。良好的参数初始化也可能是有益的。 +* 尽量减少训练错误并不能 * 保证我们找到最佳的参数集来最大限度地减少泛化错误。 +* 优化问题可能有许多局部最低限度。 +* 问题可能有更多的马鞍点,因为通常问题不是凸起的。 +* 渐变消失可能会导致优化停滞。重新参数化问题通常会有所帮助。对参数进行良好的初始化也可能是有益的。 ## 练习 -1. 考虑一个简单的多层感知机,其有一个 $d$ 维的隐藏层和一个输出。证明对于任何局部最小值至少有 $d!$ 个行为相同的等价解。 -1. 假设我们有一个对称随机矩阵 $\mathbf{M}$,其中元素 $M_{ij} = M_{ji}$ ,并且每个元素都是基于某个概率分布 $p_{ij}$ 提取出来的。此外,假设 $p_{ij}(x) = p_{ij}(-x)$,即分布是对称的(详见 :cite:`Wigner.1958` )。 - * 证明了特征值上的分布也是对称的。即对于任何特征向量 $\mathbf{v}$,相关特征值 $\lambda$ 的概率满足 $P(\lambda > 0) = P(\lambda < 0)$ 。 - * 为什么上面的证明 *没有* 隐含 $P(\lambda > 0) = 0.5$? -1. 在深度学习优化过程中,你还能想到哪些挑战? -1. 假设你想在一个(真实的)马鞍上平衡一个(真实的)球。 - * 为什么这么难? - * 你能利用这种效果优化算法吗? +1. 考虑一个简单的 MLP,隐藏层中有一个隐藏层(例如,)$d$ 维度和单个输出。表明对于任何本地最低限度来说,至少有 $d!$ 行为相同的等效解决方案。 +1. 假设我们有一个对称随机矩阵 $\mathbf{M}$,其中条目 $M_{ij} = M_{ji}$ 各自从某种概率分布 $p_{ij}$ 中提取。此外,假设 $p_{ij}(x) = p_{ij}(-x)$,即分布是对称的(详情请参见 :cite:`Wigner.1958`)。 + 1. 证明特征值的分布也是对称的。也就是说,对于任何特征向量 $\mathbf{v}$,关联的特征值 $\lambda$ 满足 $P(\lambda > 0) = P(\lambda < 0)$ 的概率为 $P(\lambda > 0) = P(\lambda < 0)$。 + 1. 为什么以上 * 不 * 暗示 $P(\lambda > 0) = 0.5$? +1. 你能想到深度学习优化还涉及哪些其他挑战? +1. 假设你想在(真实的)鞍上平衡一个(真实的)球。 + 1. 为什么这很难? + 1. 你能也利用这种效果进行优化算法吗? :begin_tab:`mxnet` [Discussions](https://discuss.d2l.ai/t/349) @@ -197,4 +164,3 @@ annotate('vanishing gradient', (4, 1), (2, 0.0)) :begin_tab:`tensorflow` [Discussions](https://discuss.d2l.ai/t/489) :end_tab: - diff --git a/chapter_optimization/optimization-intro_origin.md b/chapter_optimization/optimization-intro_origin.md new file mode 100644 index 000000000..dec017f80 --- /dev/null +++ b/chapter_optimization/optimization-intro_origin.md @@ -0,0 +1,231 @@ +# Optimization and Deep Learning + +In this section, we will discuss the relationship between optimization and deep learning as well as the challenges of using optimization in deep learning. +For a deep learning problem, we will usually define a *loss function* first. Once we have the loss function, we can use an optimization algorithm in attempt to minimize the loss. +In optimization, a loss function is often referred to as the *objective function* of the optimization problem. By tradition and convention most optimization algorithms are concerned with *minimization*. If we ever need to maximize an objective there is a simple solution: just flip the sign on the objective. + +## Goal of Optimization + +Although optimization provides a way to minimize the loss function for deep +learning, in essence, the goals of optimization and deep learning are +fundamentally different. +The former is primarily concerned with minimizing an +objective whereas the latter is concerned with finding a suitable model, given a +finite amount of data. +In :numref:`sec_model_selection`, +we discussed the difference between these two goals in detail. +For instance, +training error and generalization error generally differ: since the objective +function of the optimization algorithm is usually a loss function based on the +training dataset, the goal of optimization is to reduce the training error. +However, the goal of deep learning (or more broadly, statistical inference) is to +reduce the generalization error. +To accomplish the latter we need to pay +attention to overfitting in addition to using the optimization algorithm to +reduce the training error. + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mpl_toolkits import mplot3d +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import numpy as np +from mpl_toolkits import mplot3d +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import numpy as np +from mpl_toolkits import mplot3d +import tensorflow as tf +``` + +To illustrate the aforementioned different goals, +let us consider +the empirical risk and the risk. +As described +in :numref:`subsec_empirical-risk-and-risk`, +the empirical risk +is an average loss +on the training dataset +while the risk is the expected loss +on the entire population of data. +Below we define two functions: +the risk function `f` +and the empirical risk function `g`. +Suppose that we have only a finite amount of training data. +As a result, here `g` is less smooth than `f`. + +```{.python .input} +#@tab all +def f(x): + return x * d2l.cos(np.pi * x) + +def g(x): + return f(x) + 0.2 * d2l.cos(5 * np.pi * x) +``` + +The graph below illustrates that the minimum of the empirical risk on a training dataset may be at a different location from the minimum of the risk (generalization error). + +```{.python .input} +#@tab all +def annotate(text, xy, xytext): #@save + d2l.plt.gca().annotate(text, xy=xy, xytext=xytext, + arrowprops=dict(arrowstyle='->')) + +x = d2l.arange(0.5, 1.5, 0.01) +d2l.set_figsize((4.5, 2.5)) +d2l.plot(x, [f(x), g(x)], 'x', 'risk') +annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1)) +annotate('min of risk', (1.1, -1.05), (0.95, -0.5)) +``` + +## Optimization Challenges in Deep Learning + +In this chapter, we are going to focus specifically on the performance of optimization algorithms in minimizing the objective function, rather than a +model's generalization error. +In :numref:`sec_linear_regression` +we distinguished between analytical solutions and numerical solutions in +optimization problems. +In deep learning, most objective functions are +complicated and do not have analytical solutions. Instead, we must use numerical +optimization algorithms. +The optimization algorithms in this chapter +all fall into this +category. + +There are many challenges in deep learning optimization. Some of the most vexing ones are local minima, saddle points, and vanishing gradients. +Let us have a look at them. + + +### Local Minima + +For any objective function $f(x)$, +if the value of $f(x)$ at $x$ is smaller than the values of $f(x)$ at any other points in the vicinity of $x$, then $f(x)$ could be a local minimum. +If the value of $f(x)$ at $x$ is the minimum of the objective function over the entire domain, +then $f(x)$ is the global minimum. + +For example, given the function + +$$f(x) = x \cdot \text{cos}(\pi x) \text{ for } -1.0 \leq x \leq 2.0,$$ + +we can approximate the local minimum and global minimum of this function. + +```{.python .input} +#@tab all +x = d2l.arange(-1.0, 2.0, 0.01) +d2l.plot(x, [f(x), ], 'x', 'f(x)') +annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0)) +annotate('global minimum', (1.1, -0.95), (0.6, 0.8)) +``` + +The objective function of deep learning models usually has many local optima. +When the numerical solution of an optimization problem is near the local optimum, the numerical solution obtained by the final iteration may only minimize the objective function *locally*, rather than *globally*, as the gradient of the objective function's solutions approaches or becomes zero. +Only some degree of noise might knock the parameter out of the local minimum. In fact, this is one of the beneficial properties of +minibatch stochastic gradient descent where the natural variation of gradients over minibatches is able to dislodge the parameters from local minima. + + +### Saddle Points + +Besides local minima, saddle points are another reason for gradients to vanish. A *saddle point* is any location where all gradients of a function vanish but which is neither a global nor a local minimum. +Consider the function $f(x) = x^3$. Its first and second derivative vanish for $x=0$. Optimization might stall at this point, even though it is not a minimum. + +```{.python .input} +#@tab all +x = d2l.arange(-2.0, 2.0, 0.01) +d2l.plot(x, [x**3], 'x', 'f(x)') +annotate('saddle point', (0, -0.2), (-0.52, -5.0)) +``` + +Saddle points in higher dimensions are even more insidious, as the example below shows. Consider the function $f(x, y) = x^2 - y^2$. It has its saddle point at $(0, 0)$. This is a maximum with respect to $y$ and a minimum with respect to $x$. Moreover, it *looks* like a saddle, which is where this mathematical property got its name. + +```{.python .input} +#@tab all +x, y = d2l.meshgrid( + d2l.linspace(-1.0, 1.0, 101), d2l.linspace(-1.0, 1.0, 101)) +z = x**2 - y**2 + +ax = d2l.plt.figure().add_subplot(111, projection='3d') +ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10}) +ax.plot([0], [0], [0], 'rx') +ticks = [-1, 0, 1] +d2l.plt.xticks(ticks) +d2l.plt.yticks(ticks) +ax.set_zticks(ticks) +d2l.plt.xlabel('x') +d2l.plt.ylabel('y'); +``` + +We assume that the input of a function is a $k$-dimensional vector and its +output is a scalar, so its Hessian matrix will have $k$ eigenvalues +(refer to the [online appendix on eigendecompositions](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/eigendecomposition.html)). +The solution of the +function could be a local minimum, a local maximum, or a saddle point at a +position where the function gradient is zero: + +* When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all positive, we have a local minimum for the function. +* When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all negative, we have a local maximum for the function. +* When the eigenvalues of the function's Hessian matrix at the zero-gradient position are negative and positive, we have a saddle point for the function. + +For high-dimensional problems the likelihood that at least *some* of the eigenvalues are negative is quite high. This makes saddle points more likely than local minima. We will discuss some exceptions to this situation in the next section when introducing convexity. In short, convex functions are those where the eigenvalues of the Hessian are never negative. Sadly, though, most deep learning problems do not fall into this category. Nonetheless it is a great tool to study optimization algorithms. + +### Vanishing Gradients + +Probably the most insidious problem to encounter is the vanishing gradient. +Recall our commonly-used activation functions and their derivatives in :numref:`subsec_activation-functions`. +For instance, assume that we want to minimize the function $f(x) = \tanh(x)$ and we happen to get started at $x = 4$. As we can see, the gradient of $f$ is close to nil. +More specifically, $f'(x) = 1 - \tanh^2(x)$ and thus $f'(4) = 0.0013$. +Consequently, optimization will get stuck for a long time before we make progress. This turns out to be one of the reasons that training deep learning models was quite tricky prior to the introduction of the ReLU activation function. + +```{.python .input} +#@tab all +x = d2l.arange(-2.0, 5.0, 0.01) +d2l.plot(x, [d2l.tanh(x)], 'x', 'f(x)') +annotate('vanishing gradient', (4, 1), (2, 0.0)) +``` + +As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust range of algorithms that perform well and that are easy to use even for beginners. Furthermore, it is not really necessary to find *the* best solution. Local optima or even approximate solutions thereof are still very useful. + +## Summary + +* Minimizing the training error does *not* guarantee that we find the best set of parameters to minimize the generalization error. +* The optimization problems may have many local minima. +* The problem may have even more saddle points, as generally the problems are not convex. +* Vanishing gradients can cause optimization to stall. Often a reparameterization of the problem helps. Good initialization of the parameters can be beneficial, too. + + +## Exercises + +1. Consider a simple MLP with a single hidden layer of, say, $d$ dimensions in the hidden layer and a single output. Show that for any local minimum there are at least $d!$ equivalent solutions that behave identically. +1. Assume that we have a symmetric random matrix $\mathbf{M}$ where the entries + $M_{ij} = M_{ji}$ are each drawn from some probability distribution + $p_{ij}$. Furthermore assume that $p_{ij}(x) = p_{ij}(-x)$, i.e., that the + distribution is symmetric (see e.g., :cite:`Wigner.1958` for details). + 1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvector $\mathbf{v}$ the probability that the associated eigenvalue $\lambda$ satisfies $P(\lambda > 0) = P(\lambda < 0)$. + 1. Why does the above *not* imply $P(\lambda > 0) = 0.5$? +1. What other challenges involved in deep learning optimization can you think of? +1. Assume that you want to balance a (real) ball on a (real) saddle. + 1. Why is this hard? + 1. Can you exploit this effect also for optimization algorithms? + +:begin_tab:`mxnet` +[Discussions](https://discuss.d2l.ai/t/349) +:end_tab: + +:begin_tab:`pytorch` +[Discussions](https://discuss.d2l.ai/t/487) +:end_tab: + +:begin_tab:`tensorflow` +[Discussions](https://discuss.d2l.ai/t/489) +:end_tab: diff --git a/chapter_optimization/sgd.md b/chapter_optimization/sgd.md new file mode 100644 index 000000000..22d8b5429 --- /dev/null +++ b/chapter_optimization/sgd.md @@ -0,0 +1,249 @@ +# 随机梯度下降 +:label:`sec_sgd` + +但是,在前面的章节中,我们一直在训练过程中使用随机梯度下降,但没有解释它为什么起作用。为了澄清这一点,我们刚在 :numref:`sec_gd` 中描述了梯度下降的基本原则。在本节中,我们继续讨论 +*更详细地说明随机梯度下降 *。 + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +import math +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import math +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import math +import tensorflow as tf +``` + +## 随机渐变更新 + +在深度学习中,目标函数通常是训练数据集中每个示例的损失函数的平均值。给定 $n$ 个示例的训练数据集,我们假设 $f_i(\mathbf{x})$ 是与指数 $i$ 的训练示例相比的损失函数,其中 $\mathbf{x}$ 是参数矢量。然后我们到达目标功能 + +$$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ + +$\mathbf{x}$ 的目标函数的梯度计算为 + +$$\nabla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}).$$ + +如果使用梯度下降,则每次独立变量迭代的计算成本为 $\mathcal{O}(n)$,随 $n$ 线性增长。因此,当训练数据集较大时,每次迭代的梯度下降成本将更高。 + +随机梯度下降 (SGD) 可降低每次迭代时的计算成本。在随机梯度下降的每次迭代中,我们随机统一采样一个指数 $i\in\{1,\ldots, n\}$ 以获取数据示例,并计算渐变 $\nabla f_i(\mathbf{x})$ 以更新 $\mathbf{x}$: + +$$\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}),$$ + +其中 $\eta$ 是学习率。我们可以看到,每次迭代的计算成本从梯度下降的 $\mathcal{O}(n)$ 降至常数 $\mathcal{O}(1)$。此外,我们要强调,随机梯度 $\nabla f_i(\mathbf{x})$ 是对完整梯度 $\nabla f(\mathbf{x})$ 的公正估计,因为 + +$$\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x}).$$ + +这意味着,平均而言,随机梯度是对梯度的良好估计值。 + +现在,我们将把它与梯度下降进行比较,方法是向渐变添平均值 0 和方差 1 的随机噪声,以模拟随机渐变下降。 + +```{.python .input} +#@tab all +def f(x1, x2): # Objective function + return x1 ** 2 + 2 * x2 ** 2 + +def f_grad(x1, x2): # Gradient of the objective function + return 2 * x1, 4 * x2 +``` + +```{.python .input} +#@tab mxnet, pytorch +def sgd(x1, x2, s1, s2, f_grad): + g1, g2 = f_grad(x1, x2) + # Simulate noisy gradient + g1 += d2l.normal(0.0, 1, (1,)) + g2 += d2l.normal(0.0, 1, (1,)) + eta_t = eta * lr() + return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0) +``` + +```{.python .input} +#@tab tensorflow +def sgd(x1, x2, s1, s2, f_grad): + g1, g2 = f_grad(x1, x2) + # Simulate noisy gradient + g1 += d2l.normal([1], 0.0, 1) + g2 += d2l.normal([1], 0.0, 1) + eta_t = eta * lr() + return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0) +``` + +```{.python .input} +#@tab all +def constant_lr(): + return 1 + +eta = 0.1 +lr = constant_lr # Constant learning rate +d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad)) +``` + +正如我们所看到的,随机梯度下降中变量的轨迹比我们在 :numref:`sec_gd` 中观察到的梯度下降中观察到的轨迹嘈杂得多。这是由于梯度的随机性质。也就是说,即使我们接近最低值,我们仍然受到通过 $\eta \nabla f_i(\mathbf{x})$ 的瞬间梯度所注入的不确定性的影响。即使经过 50 个步骤,质量仍然不那么好。更糟糕的是,经过额外的步骤,它不会改善(我们鼓励你尝试更多的步骤来确认这一点)。这给我们留下了唯一的选择:改变学习率 $\eta$。但是,如果我们选择太小,我们一开始就不会取得任何有意义的进展。另一方面,如果我们选择太大,我们将无法获得上文所述的好解决方案。解决这些相互冲突的目标的唯一方法是随着优化的进展动态 * 降低学习率 *。 + +这也是在 `sgd` 步长函数中添加学习率函数 `lr` 的原因。在上面的示例中,任何学习率调度功能都处于休眠状态,因为我们将关联的 `lr` 函数设置为恒定。 + +## 动态学习率 + +用时间相关的学习率 $\eta(t)$ 取代 $\eta$ 增加了控制优化算法收敛的复杂性。特别是,我们需要弄清 $\eta$ 应该有多快衰减。如果速度太快,我们将过早停止优化。如果我们减少速度太慢,我们会在优化上浪费太多时间。以下是随着时间推移调整 $\eta$ 时使用的一些基本策略(稍后我们将讨论更高级的策略): + +$$ +\begin{aligned} + \eta(t) & = \eta_i \text{ if } t_i \leq t \leq t_{i+1} && \text{piecewise constant} \\ + \eta(t) & = \eta_0 \cdot e^{-\lambda t} && \text{exponential decay} \\ + \eta(t) & = \eta_0 \cdot (\beta t + 1)^{-\alpha} && \text{polynomial decay} +\end{aligned} +$$ + +在第一个 * 分段常数 * 场景中,我们会降低学习率,例如,每当优化进度停顿时。这是训练深度网络的常见策略。或者,我们可以通过 * 指数衰减 * 来更积极地减少它。不幸的是,这往往会导致算法收敛之前过早停止。一个受欢迎的选择是 * 多项式衰变 * 与 $\alpha = 0.5$。在凸优化的情况下,有许多证据表明这种速率表现良好。 + +让我们看看指数衰减在实践中是什么样子。 + +```{.python .input} +#@tab all +def exponential_lr(): + # Global variable that is defined outside this function and updated inside + global t + t += 1 + return math.exp(-0.1 * t) + +t = 1 +lr = exponential_lr +d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad)) +``` + +正如预期的那样,参数的差异大大减少。但是,这是以未能融合到最佳解决方案 $\mathbf{x} = (0, 0)$ 为代价的。即使经过 1000 个迭代步骤,我们仍然离最佳解决方案很远。事实上,该算法根本无法收敛。另一方面,如果我们使用多项式衰减,其中学习率下降,步数的逆平方根,那么仅在 50 个步骤之后,收敛就会更好。 + +```{.python .input} +#@tab all +def polynomial_lr(): + # Global variable that is defined outside this function and updated inside + global t + t += 1 + return (1 + 0.1 * t) ** (-0.5) + +t = 1 +lr = polynomial_lr +d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad)) +``` + +关于如何设置学习率,还有更多的选择。例如,我们可以从较小的利率开始,然后迅速上涨,然后再次降低,尽管速度更慢。我们甚至可以在较小和更大的学习率之间交替。这样的时间表有各种各样。现在,让我们专注于可以进行全面理论分析的学习率时间表,即凸环境下的学习率。对于一般的非凸问题,很难获得有意义的收敛保证,因为总的来说,最大限度地减少非线性非凸问题是 NP 困难的。有关调查,例如,请参阅 Tibshirani 2015 年的优秀 [讲义笔记](https://www.stat.cmu.edu/~ryantibs/convexopt-F15/lectures/26-nonconvex.pdf)。 + +## 凸目标的收敛性分析 + +以下对凸目标函数的随机梯度下降的收敛性分析是可选的,主要用于传达对问题的更多直觉。我们只限于最简单的证明之一 :cite:`Nesterov.Vial.2000`。存在着明显更先进的证明技术,例如,当客观功能表现特别好时。 + +假设所有 $\boldsymbol{\xi}$ 的目标函数 $f(\boldsymbol{\xi}, \mathbf{x})$ 在 $\mathbf{x}$ 中都是凸的。更具体地说,我们考虑随机梯度下降更新: + +$$\mathbf{x}_{t+1} = \mathbf{x}_{t} - \eta_t \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x}),$$ + +其中 $f(\boldsymbol{\xi}_t, \mathbf{x})$ 是培训实例 $f(\boldsymbol{\xi}_t, \mathbf{x})$ 的客观功能:$\boldsymbol{\xi}_t$ 从第 $t$ 步的某些分布中摘取,$\mathbf{x}$ 是模型参数。表示通过 + +$$R(\mathbf{x}) = E_{\boldsymbol{\xi}}[f(\boldsymbol{\xi}, \mathbf{x})]$$ + +预期风险和 $R^*$ 相对于 $\mathbf{x}$ 的最低风险。最后让 $\mathbf{x}^*$ 成为最小化器(我们假设它存在于定义 $\mathbf{x}$ 的域中)。在这种情况下,我们可以跟踪当前参数 $\mathbf{x}_t$ 当时 $\mathbf{x}_t$ 和风险最小化器 $\mathbf{x}^*$ 之间的距离,看看它是否随着时间的推移而改善: + +$$\begin{aligned} &\|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2 \\ =& \|\mathbf{x}_{t} - \eta_t \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x}) - \mathbf{x}^*\|^2 \\ =& \|\mathbf{x}_{t} - \mathbf{x}^*\|^2 + \eta_t^2 \|\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\|^2 - 2 \eta_t \left\langle \mathbf{x}_t - \mathbf{x}^*, \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\right\rangle. \end{aligned}$$ +:eqlabel:`eq_sgd-xt+1-xstar` + +我们假设 $L_2$ 随机梯度 $\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})$ 的标准受到一定的 $L$ 的限制,因此我们有这个 + +$$\eta_t^2 \|\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\|^2 \leq \eta_t^2 L^2.$$ +:eqlabel:`eq_sgd-L` + +我们最感兴趣的是 $\mathbf{x}_t$ 和 $\mathbf{x}^*$ 之间的距离如何变化 * 预期 *。事实上,对于任何具体的步骤序列,距离可能会增加,这取决于我们遇到的 $\boldsymbol{\xi}_t$。因此我们需要绑定点积。因为对于任何凸函数 $f$,它认为所有 $\mathbf{x}$ 和 $\mathbf{y}$ 的 $f(\mathbf{y}) \geq f(\mathbf{x}) + \langle f'(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle$ 和 $\mathbf{y}$,按凸度我们有 + +$$f(\boldsymbol{\xi}_t, \mathbf{x}^*) \geq f(\boldsymbol{\xi}_t, \mathbf{x}_t) + \left\langle \mathbf{x}^* - \mathbf{x}_t, \partial_{\mathbf{x}} f(\boldsymbol{\xi}_t, \mathbf{x}_t) \right\rangle.$$ +:eqlabel:`eq_sgd-f-xi-xstar` + +将不等式 :eqref:`eq_sgd-L` 和 :eqref:`eq_sgd-f-xi-xstar` 插入 :eqref:`eq_sgd-xt+1-xstar` 我们在时间 $t+1$ 时获得参数之间距离的边界,如下所示: + +$$\|\mathbf{x}_{t} - \mathbf{x}^*\|^2 - \|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2 \geq 2 \eta_t (f(\boldsymbol{\xi}_t, \mathbf{x}_t) - f(\boldsymbol{\xi}_t, \mathbf{x}^*)) - \eta_t^2 L^2.$$ +:eqlabel:`eqref_sgd-xt-diff` + +这意味着,只要当前亏损和最佳损失之间的差异超过 $\eta_t L^2/2$,我们就会取得进展。由于这种差异必然会收敛到零,因此学习率 $\eta_t$ 也需要 * 消失 *。 + +接下来,我们的预期超过 :eqref:`eqref_sgd-xt-diff`。这会产生 + +$$E\left[\|\mathbf{x}_{t} - \mathbf{x}^*\|^2\right] - E\left[\|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2\right] \geq 2 \eta_t [E[R(\mathbf{x}_t)] - R^*] - \eta_t^2 L^2.$$ + +最后一步是对 $t \in \{1, \ldots, T\}$ 的不平等现象进行总结。自从总和望远镜以及通过掉低期我们获得的 + +$$\|\mathbf{x}_1 - \mathbf{x}^*\|^2 \geq 2 \left (\sum_{t=1}^T \eta_t \right) [E[R(\mathbf{x}_t)] - R^*] - L^2 \sum_{t=1}^T \eta_t^2.$$ +:eqlabel:`eq_sgd-x1-xstar` + +请注意,我们利用了 $\mathbf{x}_1$ 给出了,因此预期可以下降。最后定义 + +$$\bar{\mathbf{x}} \stackrel{\mathrm{def}}{=} \frac{\sum_{t=1}^T \eta_t \mathbf{x}_t}{\sum_{t=1}^T \eta_t}.$$ + +自 + +$$E\left(\frac{\sum_{t=1}^T \eta_t R(\mathbf{x}_t)}{\sum_{t=1}^T \eta_t}\right) = \frac{\sum_{t=1}^T \eta_t E[R(\mathbf{x}_t)]}{\sum_{t=1}^T \eta_t} = E[R(\mathbf{x}_t)],$$ + +根据延森的不平等性(设定为 $i=t$,$i=t$,$\alpha_i = \eta_t/\sum_{t=1}^T \eta_t$)和 $R$ 的凸度为 $R$,因此, + +$$\sum_{t=1}^T \eta_t E[R(\mathbf{x}_t)] \geq \sum_{t=1}^T \eta_t E\left[R(\bar{\mathbf{x}})\right].$$ + +将其插入不平等性 :eqref:`eq_sgd-x1-xstar` 收益了限制 + +$$ +\left[E[\bar{\mathbf{x}}]\right] - R^* \leq \frac{r^2 + L^2 \sum_{t=1}^T \eta_t^2}{2 \sum_{t=1}^T \eta_t}, +$$ + +其中 $r^2 \stackrel{\mathrm{def}}{=} \|\mathbf{x}_1 - \mathbf{x}^*\|^2$ 受初始选择参数与最终结果之间的距离的约束。简而言之,收敛速度取决于随机梯度标准的限制方式($L$)以及初始参数值与最优性($r$)的距离($r$)。请注意,约束是按 $\bar{\mathbf{x}}$ 而不是 $\mathbf{x}_T$ 而不是 $\mathbf{x}_T$。情况就是这样,因为 $\bar{\mathbf{x}}$ 是优化路径的平滑版本。只要知道 $r, L$ 和 $T$,我们就可以选择学习率 $\eta = r/(L \sqrt{T})$。这个收益率为上限 $rL/\sqrt{T}$。也就是说,我们将汇率 $\mathcal{O}(1/\sqrt{T})$ 收敛到最佳解决方案。 + +## 随机梯度和有限样本 + +到目前为止,在谈论随机梯度下降时,我们玩得有点快而松散。我们假设我们从 $x_i$ 中绘制实例 $x_i$,通常使用来自某些发行版 $p(x, y)$ 的标签 $y_i$,我们用它来以某种方式更新模型参数。特别是,对于有限的样本数量,我们只是认为,某些函数 $\delta_{x_i}$ 和 $\delta_{y_i}$ 的离散分布 $p(x, y) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}(x) \delta_{y_i}(y)$ 和 $\delta_{y_i}$ 允许我们在其上执行随机梯度下降。 + +但是,这不是我们真正做的。在当前部分的玩具示例中,我们只是将噪音添加到其他非随机梯度上,也就是说,我们假装了对 $(x_i, y_i)$。事实证明,这是合理的(请参阅练习进行详细讨论)。更令人不安的是,在以前的所有讨论中,我们显然没有这样做。相反,我们遍历了所有实例 * 恰好一次 *。要了解为什么这更可取,请考虑反之,即我们从离散分布 * 中抽取 $n$ 个观测值 * 并带替换 *。随机选择一个元素 $i$ 的概率是 $1/n$。因此选择它 * 至少 * 一次就是 + +$$P(\mathrm{choose~} i) = 1 - P(\mathrm{omit~} i) = 1 - (1-1/n)^n \approx 1-e^{-1} \approx 0.63.$$ + +类似的推理表明,挑选一些样本(即训练示例)* 恰好一次 * 的概率是由 + +$${n \choose 1} \frac{1}{n} \left(1-\frac{1}{n}\right)^{n-1} = \frac{n}{n-1} \left(1-\frac{1}{n}\right)^{n} \approx e^{-1} \approx 0.37.$$ + +这导致与采样 * 不替换 * 相比,差异增加并降低数据效率。因此,在实践中我们执行后者(这是本书中的默认选择)。最后一点注意,重复穿过训练数据集会以 * 不同的 * 随机顺序遍历它。 + +## 摘要 + +* 对于凸出的问题,我们可以证明,对于广泛的学习率选择,随机梯度下降将收敛到最佳解决方案。 +* 对于深度学习而言,情况通常并非如此。但是,对凸问题的分析使我们能够深入了解如何进行优化,即逐步降低学习率,尽管不是太快。 +* 如果学习率太小或太大,就会出现问题。实际上,通常只有经过多次实验后才能找到合适的学习率。 +* 当训练数据集中有更多示例时,计算渐变下降的每个迭代的成本更高,因此在这些情况下,首选随机梯度下降。 +* 随机梯度下降的最佳性保证在非凸情况下一般不可用,因为需要检查的局部最小值数可能是指数级的。 + +## 练习 + +1. 尝试不同的学习速率计划以实现随机梯度下降和不同迭代次数。特别是,根据迭代次数的函数来绘制与最佳解 $(0, 0)$ 的距离。 +1. 证明对于函数 $f(x_1, x_2) = x_1^2 + 2 x_2^2$ 而言,向梯度添加正常噪声等同于最小化损耗函数 $f(\mathbf{x}, \mathbf{w}) = (x_1 - w_1)^2 + 2 (x_2 - w_2)^2$,其中 $\mathbf{x}$ 是从正态分布中提取的。 +1. 比较随机梯度下降的收敛性,当您从 $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ 采样时使用替换方法进行采样时以及在不替换的情况下进行样品时 +1. 如果某些渐变(或者更确切地说与之相关的某些坐标)始终比所有其他渐变都大,你将如何更改随机渐变下降求解器? +1. 假设是 $f(x) = x^2 (1 + \sin x)$。$f$ 有多少本地最小值?你能改变 $f$ 以尽量减少它需要评估所有本地最小值的方式吗? + +:begin_tab:`mxnet` +[Discussions](https://discuss.d2l.ai/t/352) +:end_tab: + +:begin_tab:`pytorch` +[Discussions](https://discuss.d2l.ai/t/497) +:end_tab: + +:begin_tab:`tensorflow` +[Discussions](https://discuss.d2l.ai/t/1067) +:end_tab: diff --git a/chapter_optimization/sgd_origin.md b/chapter_optimization/sgd_origin.md new file mode 100644 index 000000000..2a77c1dee --- /dev/null +++ b/chapter_optimization/sgd_origin.md @@ -0,0 +1,289 @@ +# Stochastic Gradient Descent +:label:`sec_sgd` + +In earlier chapters we kept using stochastic gradient descent in our training procedure, however, without explaining why it works. +To shed some light on it, +we just described the basic principles of gradient descent +in :numref:`sec_gd`. +In this section, we go on to discuss +*stochastic gradient descent* in greater detail. + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +import math +from mxnet import np, npx +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import math +import torch +``` + +```{.python .input} +#@tab tensorflow +%matplotlib inline +from d2l import tensorflow as d2l +import math +import tensorflow as tf +``` + +## Stochastic Gradient Updates + +In deep learning, the objective function is usually the average of the loss functions for each example in the training dataset. +Given a training dataset of $n$ examples, +we assume that $f_i(\mathbf{x})$ is the loss function +with respect to the training example of index $i$, +where $\mathbf{x}$ is the parameter vector. +Then we arrive at the objective function + +$$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ + +The gradient of the objective function at $\mathbf{x}$ is computed as + +$$\nabla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}).$$ + +If gradient descent is used, the computational cost for each independent variable iteration is $\mathcal{O}(n)$, which grows linearly with $n$. Therefore, when the training dataset is larger, the cost of gradient descent for each iteration will be higher. + +Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index $i\in\{1,\ldots, n\}$ for data examples at random, and compute the gradient $\nabla f_i(\mathbf{x})$ to update $\mathbf{x}$: + +$$\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}),$$ + +where $\eta$ is the learning rate. We can see that the computational cost for each iteration drops from $\mathcal{O}(n)$ of the gradient descent to the constant $\mathcal{O}(1)$. Moreover, we want to emphasize that the stochastic gradient $\nabla f_i(\mathbf{x})$ is an unbiased estimate of the full gradient $\nabla f(\mathbf{x})$ because + +$$\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x}).$$ + +This means that, on average, the stochastic gradient is a good estimate of the gradient. + +Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent. + +```{.python .input} +#@tab all +def f(x1, x2): # Objective function + return x1 ** 2 + 2 * x2 ** 2 + +def f_grad(x1, x2): # Gradient of the objective function + return 2 * x1, 4 * x2 +``` + +```{.python .input} +#@tab mxnet, pytorch +def sgd(x1, x2, s1, s2, f_grad): + g1, g2 = f_grad(x1, x2) + # Simulate noisy gradient + g1 += d2l.normal(0.0, 1, (1,)) + g2 += d2l.normal(0.0, 1, (1,)) + eta_t = eta * lr() + return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0) +``` + +```{.python .input} +#@tab tensorflow +def sgd(x1, x2, s1, s2, f_grad): + g1, g2 = f_grad(x1, x2) + # Simulate noisy gradient + g1 += d2l.normal([1], 0.0, 1) + g2 += d2l.normal([1], 0.0, 1) + eta_t = eta * lr() + return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0) +``` + +```{.python .input} +#@tab all +def constant_lr(): + return 1 + +eta = 0.1 +lr = constant_lr # Constant learning rate +d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad)) +``` + +As we can see, the trajectory of the variables in the stochastic gradient descent is much more noisy than the one we observed in gradient descent in :numref:`sec_gd`. This is due to the stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still subject to the uncertainty injected by the instantaneous gradient via $\eta \nabla f_i(\mathbf{x})$. Even after 50 steps the quality is still not so good. Even worse, it will not improve after additional steps (we encourage you to experiment with a larger number of steps to confirm this). This leaves us with the only alternative: change the learning rate $\eta$. However, if we pick this too small, we will not make any meaningful progress initially. On the other hand, if we pick it too large, we will not get a good solution, as seen above. The only way to resolve these conflicting goals is to reduce the learning rate *dynamically* as optimization progresses. + +This is also the reason for adding a learning rate function `lr` into the `sgd` step function. In the example above any functionality for learning rate scheduling lies dormant as we set the associated `lr` function to be constant. + +## Dynamic Learning Rate + +Replacing $\eta$ with a time-dependent learning rate $\eta(t)$ adds to the complexity of controlling convergence of an optimization algorithm. In particular, we need to figure out how rapidly $\eta$ should decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization. The following are a few basic strategies that are used in adjusting $\eta$ over time (we will discuss more advanced strategies later): + +$$ +\begin{aligned} + \eta(t) & = \eta_i \text{ if } t_i \leq t \leq t_{i+1} && \text{piecewise constant} \\ + \eta(t) & = \eta_0 \cdot e^{-\lambda t} && \text{exponential decay} \\ + \eta(t) & = \eta_0 \cdot (\beta t + 1)^{-\alpha} && \text{polynomial decay} +\end{aligned} +$$ + +In the first *piecewise constant* scenario we decrease the learning rate, e.g., whenever progress in optimization stalls. This is a common strategy for training deep networks. Alternatively we could decrease it much more aggressively by an *exponential decay*. Unfortunately this often leads to premature stopping before the algorithm has converged. A popular choice is *polynomial decay* with $\alpha = 0.5$. In the case of convex optimization there are a number of proofs that show that this rate is well behaved. + +Let us see what the exponential decay looks like in practice. + +```{.python .input} +#@tab all +def exponential_lr(): + # Global variable that is defined outside this function and updated inside + global t + t += 1 + return math.exp(-0.1 * t) + +t = 1 +lr = exponential_lr +d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad)) +``` + +As expected, the variance in the parameters is significantly reduced. However, this comes at the expense of failing to converge to the optimal solution $\mathbf{x} = (0, 0)$. Even after 1000 iteration steps are we are still very far away from the optimal solution. Indeed, the algorithm fails to converge at all. On the other hand, if we use a polynomial decay where the learning rate decays with the inverse square root of the number of steps, convergence gets better after only 50 steps. + +```{.python .input} +#@tab all +def polynomial_lr(): + # Global variable that is defined outside this function and updated inside + global t + t += 1 + return (1 + 0.1 * t) ** (-0.5) + +t = 1 +lr = polynomial_lr +d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad)) +``` + +There exist many more choices for how to set the learning rate. For instance, we could start with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We could even alternate between smaller and larger learning rates. There exists a large variety of such schedules. For now let us focus on learning rate schedules for which a comprehensive theoretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex problems it is very difficult to obtain meaningful convergence guarantees, since in general minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent [lecture notes](https://www.stat.cmu.edu/~ryantibs/convexopt-F15/lectures/26-nonconvex.pdf) of Tibshirani 2015. + + + +## Convergence Analysis for Convex Objectives + +The following convergence analysis of stochastic gradient descent for convex objective functions +is optional and primarily serves to convey more intuition about the problem. +We limit ourselves to one of the simplest proofs :cite:`Nesterov.Vial.2000`. +Significantly more advanced proof techniques exist, e.g., whenever the objective function is particularly well behaved. + + +Suppose that the objective function $f(\boldsymbol{\xi}, \mathbf{x})$ is convex in $\mathbf{x}$ +for all $\boldsymbol{\xi}$. +More concretely, +we consider the stochastic gradient descent update: + +$$\mathbf{x}_{t+1} = \mathbf{x}_{t} - \eta_t \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x}),$$ + +where $f(\boldsymbol{\xi}_t, \mathbf{x})$ +is the objective function +with respect to the training example $\boldsymbol{\xi}_t$ +drawn from some distribution +at step $t$ and $\mathbf{x}$ is the model parameter. +Denote by + +$$R(\mathbf{x}) = E_{\boldsymbol{\xi}}[f(\boldsymbol{\xi}, \mathbf{x})]$$ + +the expected risk and by $R^*$ its minimum with regard to $\mathbf{x}$. Last let $\mathbf{x}^*$ be the minimizer (we assume that it exists within the domain where $\mathbf{x}$ is defined). In this case we can track the distance between the current parameter $\mathbf{x}_t$ at time $t$ and the risk minimizer $\mathbf{x}^*$ and see whether it improves over time: + +$$\begin{aligned} &\|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2 \\ =& \|\mathbf{x}_{t} - \eta_t \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x}) - \mathbf{x}^*\|^2 \\ =& \|\mathbf{x}_{t} - \mathbf{x}^*\|^2 + \eta_t^2 \|\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\|^2 - 2 \eta_t \left\langle \mathbf{x}_t - \mathbf{x}^*, \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\right\rangle. \end{aligned}$$ +:eqlabel:`eq_sgd-xt+1-xstar` + +We assume that the $L_2$ norm of stochastic gradient $\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})$ is bounded by some constant $L$, hence we have that + +$$\eta_t^2 \|\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\|^2 \leq \eta_t^2 L^2.$$ +:eqlabel:`eq_sgd-L` + + +We are mostly interested in how the distance between $\mathbf{x}_t$ and $\mathbf{x}^*$ changes *in expectation*. In fact, for any specific sequence of steps the distance might well increase, depending on whichever $\boldsymbol{\xi}_t$ we encounter. Hence we need to bound the dot product. +Since for any convex function $f$ it holds that +$f(\mathbf{y}) \geq f(\mathbf{x}) + \langle f'(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle$ +for all $\mathbf{x}$ and $\mathbf{y}$, +by convexity we have + +$$f(\boldsymbol{\xi}_t, \mathbf{x}^*) \geq f(\boldsymbol{\xi}_t, \mathbf{x}_t) + \left\langle \mathbf{x}^* - \mathbf{x}_t, \partial_{\mathbf{x}} f(\boldsymbol{\xi}_t, \mathbf{x}_t) \right\rangle.$$ +:eqlabel:`eq_sgd-f-xi-xstar` + +Plugging both inequalities :eqref:`eq_sgd-L` and :eqref:`eq_sgd-f-xi-xstar` into :eqref:`eq_sgd-xt+1-xstar` we obtain a bound on the distance between parameters at time $t+1$ as follows: + +$$\|\mathbf{x}_{t} - \mathbf{x}^*\|^2 - \|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2 \geq 2 \eta_t (f(\boldsymbol{\xi}_t, \mathbf{x}_t) - f(\boldsymbol{\xi}_t, \mathbf{x}^*)) - \eta_t^2 L^2.$$ +:eqlabel:`eqref_sgd-xt-diff` + +This means that we make progress as long as the difference between current loss and the optimal loss outweighs $\eta_t L^2/2$. Since this difference is bound to converge to zero it follows that the learning rate $\eta_t$ also needs to *vanish*. + +Next we take expectations over :eqref:`eqref_sgd-xt-diff`. This yields + +$$E\left[\|\mathbf{x}_{t} - \mathbf{x}^*\|^2\right] - E\left[\|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2\right] \geq 2 \eta_t [E[R(\mathbf{x}_t)] - R^*] - \eta_t^2 L^2.$$ + +The last step involves summing over the inequalities for $t \in \{1, \ldots, T\}$. Since the sum telescopes and by dropping the lower term we obtain + +$$\|\mathbf{x}_1 - \mathbf{x}^*\|^2 \geq 2 \left (\sum_{t=1}^T \eta_t \right) [E[R(\mathbf{x}_t)] - R^*] - L^2 \sum_{t=1}^T \eta_t^2.$$ +:eqlabel:`eq_sgd-x1-xstar` + +Note that we exploited that $\mathbf{x}_1$ is given and thus the expectation can be dropped. Last define + +$$\bar{\mathbf{x}} \stackrel{\mathrm{def}}{=} \frac{\sum_{t=1}^T \eta_t \mathbf{x}_t}{\sum_{t=1}^T \eta_t}.$$ + +Since + +$$E\left(\frac{\sum_{t=1}^T \eta_t R(\mathbf{x}_t)}{\sum_{t=1}^T \eta_t}\right) = \frac{\sum_{t=1}^T \eta_t E[R(\mathbf{x}_t)]}{\sum_{t=1}^T \eta_t} = E[R(\mathbf{x}_t)],$$ + +by Jensen's inequality (setting $i=t$, $\alpha_i = \eta_t/\sum_{t=1}^T \eta_t$ in :eqref:`eq_jensens-inequality`) and convexity of $R$ it follows that $E[R(\mathbf{x}_t)] \geq E[R(\bar{\mathbf{x}})]$, thus + +$$\sum_{t=1}^T \eta_t E[R(\mathbf{x}_t)] \geq \sum_{t=1}^T \eta_t E\left[R(\bar{\mathbf{x}})\right].$$ + +Plugging this into the inequality :eqref:`eq_sgd-x1-xstar` yields the bound + +$$ +\left[E[\bar{\mathbf{x}}]\right] - R^* \leq \frac{r^2 + L^2 \sum_{t=1}^T \eta_t^2}{2 \sum_{t=1}^T \eta_t}, +$$ + +where $r^2 \stackrel{\mathrm{def}}{=} \|\mathbf{x}_1 - \mathbf{x}^*\|^2$ is a bound on the distance between the initial choice of parameters and the final outcome. In short, the speed of convergence depends on how +the norm of stochastic gradient is bounded ($L$) and how far away from optimality the initial parameter value is ($r$). Note that the bound is in terms of $\bar{\mathbf{x}}$ rather than $\mathbf{x}_T$. This is the case since $\bar{\mathbf{x}}$ is a smoothed version of the optimization path. +Whenever $r, L$, and $T$ are known we can pick the learning rate $\eta = r/(L \sqrt{T})$. This yields as upper bound $rL/\sqrt{T}$. That is, we converge with rate $\mathcal{O}(1/\sqrt{T})$ to the optimal solution. + + + + + +## Stochastic Gradients and Finite Samples + +So far we have played a bit fast and loose when it comes to talking about stochastic gradient descent. We posited that we draw instances $x_i$, typically with labels $y_i$ from some distribution $p(x, y)$ and that we use this to update the model parameters in some manner. In particular, for a finite sample size we simply argued that the discrete distribution $p(x, y) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}(x) \delta_{y_i}(y)$ +for some functions $\delta_{x_i}$ and $\delta_{y_i}$ +allows us to perform stochastic gradient descent over it. + +However, this is not really what we did. In the toy examples in the current section we simply added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs $(x_i, y_i)$. It turns out that this is justified here (see the exercises for a detailed discussion). More troubling is that in all previous discussions we clearly did not do this. Instead we iterated over all instances *exactly once*. To see why this is preferable consider the converse, namely that we are sampling $n$ observations from the discrete distribution *with replacement*. The probability of choosing an element $i$ at random is $1/n$. Thus to choose it *at least* once is + +$$P(\mathrm{choose~} i) = 1 - P(\mathrm{omit~} i) = 1 - (1-1/n)^n \approx 1-e^{-1} \approx 0.63.$$ + +A similar reasoning shows that the probability of picking some sample (i.e., training example) *exactly once* is given by + +$${n \choose 1} \frac{1}{n} \left(1-\frac{1}{n}\right)^{n-1} = \frac{n}{n-1} \left(1-\frac{1}{n}\right)^{n} \approx e^{-1} \approx 0.37.$$ + +This leads to an increased variance and decreased data efficiency relative to sampling *without replacement*. Hence, in practice we perform the latter (and this is the default choice throughout this book). Last note that repeated passes through the training dataset traverse it in a *different* random order. + + +## Summary + +* For convex problems we can prove that for a wide choice of learning rates stochastic gradient descent will converge to the optimal solution. +* For deep learning this is generally not the case. However, the analysis of convex problems gives us useful insight into how to approach optimization, namely to reduce the learning rate progressively, albeit not too quickly. +* Problems occur when the learning rate is too small or too large. In practice a suitable learning rate is often found only after multiple experiments. +* When there are more examples in the training dataset, it costs more to compute each iteration for gradient descent, so stochastic gradient descent is preferred in these cases. +* Optimality guarantees for stochastic gradient descent are in general not available in nonconvex cases since the number of local minima that require checking might well be exponential. + + + + +## Exercises + +1. Experiment with different learning rate schedules for stochastic gradient descent and with different numbers of iterations. In particular, plot the distance from the optimal solution $(0, 0)$ as a function of the number of iterations. +1. Prove that for the function $f(x_1, x_2) = x_1^2 + 2 x_2^2$ adding normal noise to the gradient is equivalent to minimizing a loss function $f(\mathbf{x}, \mathbf{w}) = (x_1 - w_1)^2 + 2 (x_2 - w_2)^2$ where $\mathbf{x}$ is drawn from a normal distribution. +1. Compare convergence of stochastic gradient descent when you sample from $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ with replacement and when you sample without replacement. +1. How would you change the stochastic gradient descent solver if some gradient (or rather some coordinate associated with it) was consistently larger than all the other gradients? +1. Assume that $f(x) = x^2 (1 + \sin x)$. How many local minima does $f$ have? Can you change $f$ in such a way that to minimize it one needs to evaluate all the local minima? + +:begin_tab:`mxnet` +[Discussions](https://discuss.d2l.ai/t/352) +:end_tab: + +:begin_tab:`pytorch` +[Discussions](https://discuss.d2l.ai/t/497) +:end_tab: + +:begin_tab:`tensorflow` +[Discussions](https://discuss.d2l.ai/t/1067) +:end_tab: