From 5975e44cddddaea3d984040a82f9bb790579379a Mon Sep 17 00:00:00 2001 From: invariantor Date: Fri, 24 Apr 2020 02:45:52 -0400 Subject: [PATCH] fix math display in ExpLR1 --- _posts/2020-04-24-ExpLR1.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/_posts/2020-04-24-ExpLR1.md b/_posts/2020-04-24-ExpLR1.md index 4c717ee..6a5bf63 100644 --- a/_posts/2020-04-24-ExpLR1.md +++ b/_posts/2020-04-24-ExpLR1.md @@ -1,7 +1,7 @@ --- layout: post title: Exponential Learning Rate Schedules for Deep Learning (Part 1) -date: 2020-04-24 02:00:00 +date: 2020-04-23 17:00:00 author: Zhiyuan Li and Sanjeev Arora visible: True --- @@ -58,19 +58,18 @@ At first sight such a claim may seem difficult (if not impossible) to prove give The formal proof holds for any training loss satisfying what we call *Scale Invariance*: -$$ L (c\cdot \theta) = L(\theta), \quad \forall \theta, \forall c >0.$$ +$$ L (c\cdot \pmb{\theta}) = L(\pmb{\theta}), \quad \forall \pmb{\theta}, \forall c >0.$$ BN and other normalization schemes result in a Scale-Invariant Loss for the popular deep architectures (Convnet, Resnet, DenseNet etc.) if the output layer --where normally no normalization is used-- is fixed throughout training. Empirically, [Hoffer et al. (2018b)](https://openreview.net/forum?id=S1Dh8Tg0-) found that randomly fixing the output layer at the start does not harm the final accuracy. (Appendix C of our paper demonstrates scale invariance for various architectures; it is somewhat nontrivial.) -For batch ${\mathcal{B}}=\{x_ i\}_ {i=1}^B$, network parameter ${\pmb{\theta}}$, we denote the network by $f_ {\pmb{\theta}}$ and the loss function at iteration $t$ by $L_ t(f_ {\theta}) = L(f_ {\theta}, {\mathcal{B}}_ t)$ . We also use $L_ t({\theta})$ for convenience. We say the network $f_ {\pmb{\theta}}$ is *scale invariant* if $\forall c>0$, $f_ {c{\pmb{\theta}}} = f_ {\pmb{\theta}}$, which implies the loss $L_ t$ is also scale invariant, i.e., $L_ t(c{\pmb{\theta}}_ t)=L_ t({\pmb{\theta}}_ t)$, $\forall c>0$. A key source of intuition is the following lemma provable via chain rule: +For batch ${\mathcal{B}} = \\{ x_ i \\} _ {i=1}^B$, network parameter ${\pmb{\theta}}$, we denote the network by $f_ {\pmb{\theta}}$ and the loss function at iteration $t$ by $L_ t(f_ {\pmb{\theta}}) = L(f_ {\pmb{\theta}}, {\mathcal{B}}_ t)$ . We also use $L_ t({\pmb{\theta}})$ for convenience. We say the network $f_ {\pmb{\theta}}$ is *scale invariant* if $\forall c>0$, $f_ {c{\pmb{\theta}}} = f_ {\pmb{\theta}}$, which implies the loss $L_ t$ is also scale invariant, i.e., $L_ t(c{\pmb{\theta}}_ t)=L_ t({\pmb{\theta}}_ t)$, $\forall c>0$. A key source of intuition is the following lemma provable via chain rule: ->**Lemma 1**. A scale-invariant loss L satisfies -> ->(1). $\langle\nabla_ {{\pmb{\theta}}}L,{\pmb{\theta}}\rangle=0$; ->(2). $\left.\nabla_ {{\pmb{\theta}}}L \right|_ {{\pmb{\theta}} = {\pmb{\theta}}_ 0} = c \left.\nabla_ {{\pmb{\theta}}}L\right|_ {{\pmb{\theta}} = c{\pmb{\theta}}_ 0}$, for any $c>0$. +>**Lemma 1**. A scale-invariant loss $L$ satisfies +>(1). $\langle\nabla_ {\pmb{\theta}} L, {\pmb{\theta}} \rangle=0$ ; +>(2). $\left.\nabla_ {\pmb{\theta}} L \right|_ {\pmb{\theta} = \pmb{\theta}_ 0} = c \left.\nabla_ {\pmb{\theta}} L\right|_ {\pmb{\theta} = c\pmb{\theta}_ 0}$, for any $c>0$. - The first property immediately implies that $\|{\pmb{\theta}}_ t\|$ is monotone increasing for SGD if WD is turned off by Pythagoren Theorem. And based on this, [our previous work](https://arxiv.org/pdf/1812.03981.pdf) with Kaifeng Lyu shows that GD with any fixed learning rate can reach $\varepsilon$ approximate stationary point for scale invariant objectives in $O(1/\varepsilon^2)$. + The first property immediately implies that $\|{\pmb{\theta}}_ t\|$ is monotone increasing for SGD if WD is turned off by Pythagoren Theorem. And based on this, [our previous work](https://arxiv.org/pdf/1812.03981.pdf) with Kaifeng Lyu shows that GD with any fixed learning rate can reach $\varepsilon$ approximate stationary point for scale invariant objectives in $O(1/\varepsilon^2)$ iterations.