first complete draft. need to work on evaluation section.

GRAM-nets · Apr 5, 2020 · 18351cd · 18351cd
1 parent fad7be1
commit 18351cd
Show file tree

Hide file tree

Showing 8 changed files with 81 additions and 29 deletions.
diff --git a/ICLR_slides.code-workspace b/ICLR_slides.code-workspace
@@ -0,0 +1,7 @@
+{
+	"folders": [
+		{
+			"path": "."
+		}
+	]
+}
diff --git a/celeba.png b/celeba.png
diff --git a/cifar10.png b/cifar10.png
diff --git a/iclr2020.md b/iclr2020.md
@@ -5,67 +5,112 @@ marp: true
 # **G**enerative **Ra**tio **M**atching (GRAM)
 
 ### Goal
-
-A *stable* learning algorithm for deep generative models with *high* dimensional data
-- MMD networks are stable but perform poorly when dimension gets large
-- Adversarial methods (GANs, MMD-GANs, etc) are not stable in general
+ 
+A *stable* learning algorithm for *implicit* deep generative models with *high* dimensional data
+- MMD networks are stable but perform poorly on high dimensional data
+- Adversarial generative methods (GANs, MMD-GANs, etc) can scale upto high dimensional (image) data but are not stable in general
 
 ### Key ideas
 
-1. Learn a reduced space in which the density ratio between the data and the generator is close to the density ratio in the original space
-2. Train the generator via the MMD loss in this reduced space
+1. Learn a low-dimensional sub-space projection in which the density ratio between the data and the generator is close to the density ratio in the original space
+2. Train the generator via the MMD loss in this space
 
 ---
 
-## Matching ratio via minimising squared ratio difference
+## Learning Low Dimensional Sub-space Projection $f_\theta(x)$
 
-We'd like to learn a parameterized transformation $f_\theta(x)$ by minimising 
+We'd like to learn a parameterized transformation $f_\theta(x)$ by minimising the squared difference between the pair of density ratios:
 
 $$
 \begin{aligned}
 D(\theta) 
 &= \int q_x(x) \left( \frac{p_x(x)}{q_x(x)} - \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} \right)^2 dx \\
-&= C - 2 \int p_x(x) \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} dx + \int q_x(x) \left( \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} \right)^2 dx \\
-&= C - 2 \int \bar{p}(f_\theta(x)) \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} df_\theta(x) + \int \bar{q}(f_\theta(x)) \left( \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} \right)^2 df_\theta(x) \\
-&= C' -  \left( \int \bar{q}(f_\theta(x)) \left( \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} \right)^2 df_\theta(x) - 1 \right) = C' - \mathrm{PD}(\bar{q}, \bar{p})
+&= C -  \left( \int \bar{q}(f_\theta(x)) \left( \frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))} \right)^2 df_\theta(x) - 1 \right)\\ 
+&= C - \mathrm{PD}(\bar{q}, \bar{p})
 \end{aligned}
 $$
 
-We can *minimise* the squared ratio difference by *maximising* PD in the reduced space :heart:
+We can *minimise* the squared difference by *maximising* Pearson Divergence in the low dimensional space :heart:
 
 ---
 
-## Filling up the missing components
+## Pearson Divergence Maximisation
 
-- MC estimation of $\mathrm{PD}(\bar{q}, \bar{p}) \approx \frac{1}{N} \sum_{i=1}^N \left( \frac{\bar{p}(f_\theta(x_i))}{\bar{q}(f_\theta(x_i))} \right)^2 - 1$ where $x^q_i \sim q_x$
-- We only need density ratios $\frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))}$ for a set of samples from $q$ during MC.
-- We use a MMD based density ratio estimator (Sugiyama et al., 2012) due to its analytical solution under fixed-design setup: $\hat{r}_q = \mathbf{K}^{-1}_{q,q} \mathbf{K}_{q,p}\mathbf{1}$.
+- We carry out a Monte Carlo approximation,
+$$ 
+\mathrm{PD}(\bar{q}, \bar{p}) \approx \frac{1}{N} \sum_{i=1}^N \left( \frac{\bar{p}(f_\theta(x_i))}{\bar{q}(f_\theta(x_i))} \right)^2 - 1 
+$$ 
+where $x^q_i \sim q_x$.
+- For this to work, we need an estimator of the density ratio.
+<!-- - We only need density ratios $\frac{\bar{p}(f_\theta(x))}{\bar{q}(f_\theta(x))}$ for a set of samples from $q$ during MC. -->
+- Use a MMD based density ratio estimator (Sugiyama et al., 2012) under the fixed-design setup: $\hat{r}_q = \mathbf{K}^{-1}_{q,q} \mathbf{K}_{q,p}\mathbf{1}$.
     - $\mathbf{K}_{q,q}$ and $\mathbf{K}_{q,p}$ are Gram matrices defined by $[\mathbf{K}_{q,q}]_{i,j} = k(f_\theta(x^q_i),f_\theta(x^q_j))$ and $[\mathbf{K}_{q,p}]_{i,j} = k(f_\theta(x^q_i),f_\theta(x^p_j)).$
-- Train the generator via the MMD loss
-- Shared Gram matrix between density ratio estimation and generator training
-- Simultaneous training of the transform function and the generator
+<!-- - Train the generator via the MMD loss -->
+<!-- - Shared Gram matrix between density ratio estimation and generator training
+- Simultaneous training of the transform function and the generator -->
 
 ---
 
-## Extra info
-
-**Density ratio estimation via (infinite) moment matching**
-
-$$
-\min_{r\in\mathcal{R}} \bigg \Vert \int k(x; .)p(x) dx - \int k(x; .)r(x)q(x) dx \bigg \Vert_{\mathcal{R}}^2
-$$
+## Density ratio estimation via (infinite) moment matching
 
 **Maximum mean discrepancy**
 
 $$
 \textrm{MMD}_{\mathcal{F}}(p,q) = \sup_{f\in\mathcal{F}} \left(\mathbb{E}_p \lbrack f(x) \rbrack - \mathbb{E}_q \lbrack f(x) \rbrack \right)
 $$
 
-Gretton et al. (2012) show that it is sufficient to choose $\mathcal{F}$ to be a unit ball in an reproducing kernel Hilbert space $\mathcal{R}$ with a characteristic kernel $k$. Its MC estimate is
+Gretton et al. (2012) show that it is sufficient to choose $\mathcal{F}$ to be a unit ball in an reproducing kernel Hilbert space $\mathcal{R}$ with a characteristic kernel $k$. 
 
-$$
+<!-- $$
 \hat{\textmd{MMD}}^2_\mathcal{R}(p,q) =
 \frac{1}{N^2}\sum_{i=1}^N\sum_{i'=1}^N k(x_i,x_{i'}) 
 - \frac{2}{NM}\sum_{i=1}^N\sum_{j=1}^M  k(x_i, y_j)
  + \frac{1}{M^2}\sum_{j=1}^M\sum_{j'=1}^M k(y_j,y_{j'})
-$$
+$$ -->
+- Using this definition of MMD, the density ratio estimator $r(x)$ can be derived as the solution to
+$$
+\min_{r\in\mathcal{R}} \bigg \Vert \int k(x; .)p(x) dx - \int k(x; .)r(x)q(x) dx \bigg \Vert_{\mathcal{R}}^2.
+$$
+
+---
+
+## Generator Training
+
+- The generator $G\gamma$ is trained by minimising the emperical estimator of MMD, 
+
+$$
+\begin{aligned}
+\min_\gamma \Bigg[&\frac{1}{N^2}\sum_{i=1}^N\sum_{i'=1}^N k(f_\theta(x_i),f_\theta(x_{i'})) 
+- \frac{2}{NM}\sum_{i=1}^N\sum_{j=1}^M  k(f_\theta(x_i), f_\theta(G_\gamma(z_j)))\\
+&\quad + \frac{1}{M^2}\sum_{j=1}^M\sum_{j'=1}^M k(f_\theta(G_\gamma(z_j)),f_\theta(G_\gamma(z_{j'}))) \Bigg ]
+\end{aligned}
+$$
+
+with respect to it's parameters $\gamma$.
+
+---
+## Evaluation
+
+**Synthtic Dataset**
+![Image description](syn_0.png)
+
+---
+**Synthtic Dataset**
+
+![Image description](syn_1.png)
+
+---
+
+![Image description](syn.png)
+
+---
+
+**CIFAR10 and CelebA**
+
+**Quantitative Results**
+![Image description](table_image.png)
+
+---
+**Qualitative Results**: Random Samples
+
+   ![Image description](cifar10.png)![Image description](celeba.png)
diff --git a/syn.png b/syn.png
diff --git a/syn_0.png b/syn_0.png
diff --git a/syn_1.png b/syn_1.png
diff --git a/table_image.png b/table_image.png