From 5975e44cddddaea3d984040a82f9bb790579379a Mon Sep 17 00:00:00 2001
From: invariantor <lizhiyuan1221@gmail.com>
Date: Fri, 24 Apr 2020 02:45:52 -0400
Subject: [PATCH] fix math display in ExpLR1

---
 _posts/2020-04-24-ExpLR1.md | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/_posts/2020-04-24-ExpLR1.md b/_posts/2020-04-24-ExpLR1.md
index 4c717ee..6a5bf63 100644
--- a/_posts/2020-04-24-ExpLR1.md
+++ b/_posts/2020-04-24-ExpLR1.md
@@ -1,7 +1,7 @@
 ---
 layout:     post
 title:      Exponential Learning Rate Schedules for Deep Learning (Part 1)
-date:       2020-04-24 02:00:00
+date:       2020-04-23 17:00:00
 author:     Zhiyuan Li and Sanjeev Arora
 visible:    True
 ---
@@ -58,19 +58,18 @@ At first sight such a claim may seem difficult (if not impossible) to prove give
 The formal proof holds for any training loss satisfying 
 what we call *Scale Invariance*:
 
-$$ L (c\cdot \theta) = L(\theta), \quad \forall \theta, \forall c >0.$$
+$$ L (c\cdot \pmb{\theta}) = L(\pmb{\theta}), \quad \forall \pmb{\theta}, \forall c >0.$$
 
 BN and other normalization schemes result in a Scale-Invariant Loss for the popular deep architectures (Convnet, Resnet, DenseNet etc.) if the output layer --where normally no normalization is used-- is fixed throughout training. Empirically, [Hoffer et al. (2018b)](https://openreview.net/forum?id=S1Dh8Tg0-)  found that randomly fixing the output layer at the start does not harm the final accuracy. 
 (Appendix C of our paper demonstrates scale invariance for  various architectures; it is somewhat nontrivial.) 
 
-For batch ${\mathcal{B}}=\{x_ i\}_ {i=1}^B$, network parameter ${\pmb{\theta}}$, we  denote the network by $f_ {\pmb{\theta}}$ and the loss function at iteration $t$ by $L_ t(f_ {\theta}) = L(f_ {\theta}, {\mathcal{B}}_ t)$ . We also use $L_ t({\theta})$ for convenience. We say the network $f_ {\pmb{\theta}}$ is *scale invariant* if $\forall c>0$, $f_ {c{\pmb{\theta}}} = f_ {\pmb{\theta}}$, which implies the loss $L_ t$ is also scale invariant, i.e., $L_  t(c{\pmb{\theta}}_ t)=L_ t({\pmb{\theta}}_ t)$, $\forall c>0$. A key source of intuition is the following lemma provable via chain  rule:
+For batch ${\mathcal{B}} = \\{ x_ i \\} _ {i=1}^B$, network parameter ${\pmb{\theta}}$, we  denote the network by $f_ {\pmb{\theta}}$ and the loss function at iteration $t$ by $L_ t(f_ {\pmb{\theta}}) = L(f_ {\pmb{\theta}}, {\mathcal{B}}_ t)$ . We also use $L_ t({\pmb{\theta}})$ for convenience. We say the network $f_ {\pmb{\theta}}$ is *scale invariant* if $\forall c>0$, $f_ {c{\pmb{\theta}}} = f_ {\pmb{\theta}}$, which implies the loss $L_ t$ is also scale invariant, i.e., $L_  t(c{\pmb{\theta}}_ t)=L_ t({\pmb{\theta}}_ t)$, $\forall c>0$. A key source of intuition is the following lemma provable via chain rule:
 
->**Lemma 1**. A scale-invariant loss L satisfies
->
->(1). $\langle\nabla_ {{\pmb{\theta}}}L,{\pmb{\theta}}\rangle=0$;  
->(2). $\left.\nabla_ {{\pmb{\theta}}}L \right|_ {{\pmb{\theta}} = {\pmb{\theta}}_  0} = c \left.\nabla_ {{\pmb{\theta}}}L\right|_ {{\pmb{\theta}} = c{\pmb{\theta}}_  0}$, for any $c>0$.
+>**Lemma 1**. A scale-invariant loss $L$ satisfies
+>(1). $\langle\nabla_ {\pmb{\theta}} L, {\pmb{\theta}} \rangle=0$ ;  
+>(2). $\left.\nabla_ {\pmb{\theta}} L \right|_ {\pmb{\theta} = \pmb{\theta}_ 0} = c \left.\nabla_ {\pmb{\theta}} L\right|_  {\pmb{\theta} = c\pmb{\theta}_ 0}$, for any $c>0$.
 
- The first property immediately implies that $\|{\pmb{\theta}}_ t\|$ is monotone increasing for SGD if WD is turned off by Pythagoren Theorem. And based on this, [our previous work](https://arxiv.org/pdf/1812.03981.pdf) with Kaifeng Lyu shows that GD with any fixed learning rate can reach $\varepsilon$ approximate stationary point for scale invariant objectives in $O(1/\varepsilon^2)$. 
+ The first property immediately implies that $\|{\pmb{\theta}}_ t\|$ is monotone increasing for SGD if WD is turned off by Pythagoren Theorem. And based on this, [our previous work](https://arxiv.org/pdf/1812.03981.pdf) with Kaifeng Lyu shows that GD with any fixed learning rate can reach $\varepsilon$ approximate stationary point for scale invariant objectives in $O(1/\varepsilon^2)$ iterations. 
 <div style="text-align:center;">
 <img style="width:360px;" src="http://www.offconvex.org/assets/inv_lemma.png" />
 <br>