Skip to content

Issue/873 embedded laplace #874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
380 changes: 380 additions & 0 deletions src/functions-reference/embedded_laplace.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
---
pagetitle: Embedded Laplace Approximation
---

# Embedded Laplace Approximation

The embedded Laplace approximation can be used to approximate certain
marginal and conditional distributions that arise in latent Gaussian models.
A latent Gaussian model observes the following hierarchical structure:
$$
\phi \sim p(\phi), \\
\theta \sim \text{MultiNormal}(0, K(\phi)), \\
y \sim p(y \mid \theta, \phi),
$$
where $K(\phi)$ denotes the prior covariance matrix parameterized by $\phi$.
To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either
use a standard method, such as Markov chain Monte Carlo, or we can follow
a two-step procedure:

1. sample from the *marginal posterior* $p(\phi \mid y)$,
2. sample from the *conditional posterior* $p(\theta \mid y, \phi)$.

In practice, neither the marginal posterior nor the conditional posterior
are available in closed form and so they must be approximated.
The marginal posterior can be written as $p(\phi \mid y) \propto p(y \mid \phi) p(\phi)$,
where $p(y \mid \phi) = \int p(y \mid \phi, \theta) p(\theta) d\theta$ $
is called marginal likelihood. The Laplace method approximates
$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution and the
resulting Gaussian integral can be evaluated analytically to obtain an
approximation to the log marginal likelihood
$\log \hat p(y \mid \phi) \approx \log p(y \mid \phi)$.

Combining this marginal likelihood with the prior in the `model`
block, we can then sample from the marginal posterior $p(\phi \mid y)$
using one of Stan's algorithms. The marginal posterior is lower
dimensional and likely to have easier shape to sample leading more
efficient inference. On the other hand each marginal likelihood
computation is more costly, and the combined change in efficiency
depends on the case.

To obtain posterior draws for $\theta$, we sample from the normal
approximation to $p(\theta \mid y, \phi)$ in `generated quantities`.
The process of iteratively sampling from $p(\phi \mid y)$ (say, with MCMC) and
then $p(\theta \mid y, \phi)$ produces samples from the joint posterior
$p(\theta, \phi \mid y)$.

The Laplace approximation is especially useful if $p(\theta)$ is
multivariate normal and $p(y \mid \phi, \theta)$ is
log-concave. Stan's embedded Laplace approximation is restricted to
have multivariate normal prior $p(\theta)$ and ... likelihood
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add here the restrictions for the likelihood

$p(y \mid \phi, \theta)$.


## Specifying the likelihood function

The first step to use the embedded Laplace approximation is to write down a
function in the `functions` block which returns the log joint likelihood
`\log p(y \mid \theta, \phi)`. There are a few constraints on this function:

* The function return type must be `real`

* The first argument must be the latent Gaussian variable $\theta$ and must
have type `vector`.

* The operations in the function must support higher-order automatic
differentation (AD). Most functions in Stan support higher-order AD.
The exceptions are functions with specialized calls for reverse-mode AD, and
these are higher-order functions (algebraic solvers, differential equation
solvers, and integrators) and the suite of hidden Markov model (HMM) functions.

The signature of the function is
```
real ll_function(vector theta, ...)
```
There is no type restrictions for the variadic arguments `...` and each
argument can be passed as data or parameter. As always, users should use
parameter arguments only when nescessary in order to speed up differentiation.
In general, we recommend marking data only arguments with the keyword `data`,
for example,
```
real ll_function(vector theta, data vector x, ...)
```

## Specifying the covariance function

We next need to specify a function that returns the prior covariance matrix
$K$ as a function of the hyperparameters $\phi$.
The only restriction is that this function returns a matrix with size
$n \times n$ where $n$ is the size of $\theta$. The signature is:
```
matrix K_function(...)
```
There is no type restrictions for the variadic arguments. The variables $\phi$
is implicitly defined as the collection of all non-data arguments passed to
`ll_function` (excluding $\theta$) and `K_function`.


## Approximating the log marginal likelihood $\log p(y \mid \phi)$

In the `model` block, we increment `target` with `laplace_marginal`, a function
that approximates the log marginal likelihood $\log p(y \mid \phi)$.
This function takes in the
user-specified likelihood and prior covariance functions, as well as their arguments.
These arguments must be passed as tuples, which can be generated on the fly
using parenthesis.
We also need to pass an argument $\theta_0$ which serves as an initial guess for
the optimization problem that underlies the Laplace approximation,
$$
\underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi).
$$
The size of $\theta_0$ must be consistent with the size of the $\theta$ argument
passed to `ll_function`.

The signature of the function is:
```
target += laplace_marginal(function ll_function, tupple (...), vector theta_0,
function K_function, tupple (...));
```
The tuple `(...)` after `ll_function` contains the arguments that get passed
to `ll_function` *excluding $\theta$*. Likewise, the tuple `(...)` after
`ll_function` contains the arguments that get passed to `K_function`.

It also possible to specify control parameters, which can help improve the
optimization that underlies the Laplace approximation. Specifically:

* `tol`: the tolerance $\epsilon$ of the optimizer. Specifically, the optimizer
stops when $||\nabla \log p(\theta \mid y, \phi)|| \le \epsilon$. By default,
the value is $\epsilon = 10^{-6}$.

* `max_num_steps`: the maximum number of steps taken by the optimizer before
it gives up (in which case the Metropolis proposal gets rejected). The default
is 100 steps.

* `hessian_block_size`: the size of the blocks, assuming the Hessian
$\partial \log p(y \mid \theta, phi) \ \partial \theta$ is block-diagonal.
The structure of the Hessian is determined by the dependence structure of $y$
on $\theta$. By default, the Hessian is treated as diagonal
(`hessian_block_size=1`). If the Hessian is not block diagonal, then set
`hessian_block_size=n`, where `n` is the size of $\theta$.

* `solver`: choice of Newton solver. The optimizer used to compute the
Laplace approximation does one of three matrix decompositions to compute a
Newton step. The problem determines which decomposition is numerical stable.
By default (`solver=1`), the solver makes a Cholesky decomposition of the
negative Hessian, $- \partial \log p(y \mid \theta, \phi) / \partial \theta$.
If `solver=2`, the solver makes a Cholesky decomposition of the covariance
matrix $K(\phi)$.
If the Cholesky decomposition cannot be computed for neither the negative
Hessian nor the covariance matrix, use `solver=3` which uses a more expensive
but less specialized approach.

* `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch
method tries to insure that the Newton step leads to a decrease in the
objective function. If the Newton step does not improve the objective function,
the step is repeatedly halved until the objective function decreases or the
maximum number of steps in the linesearch is reached. By default,
`max_steps_linesearch=0`, meaning no linesearch is performed.

With these arguments at hand, we can call `laplace_marginal_tol` with the
following signature:
```
target += laplace_margina_tol(function ll_function, tupple (...), vector theta_0,
function K_function, tupple (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);
```

## Sample from the approximate conditional $\hat{p}(\theta \mid y, \phi)$

In `generated quantities`, it is possible to sample from the Laplace
approximation of $p(\theta \mid \phi, y)$ using `laplace_latent_rng`.
The signature for `laplace_latent_rng` follows closely
the signature for `laplace_marginal`:
```
vector theta =
laplace_latent_rng(function ll_function, tupple (...), vector theta_0,
function K_function, tupple (...));
```
Once again, it is possible to specify control parameters:
```
vector theta =
laplace_latent_tol_rng(function ll_function, tupple (...), vector theta_0,
function K_function, tupple (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);
```

## Built-in Laplace marginal likelihood functions

Stan supports certain built-in Laplace marginal likelihood functions.
This selection is currently
narrow and expected to grow. The built-in functions exist for the user's
convenience but are not more computationally efficient than specifying log
likelihoods in the `functions` block.

### Poisson with log link

Given count data, with each observed count $y_i$ associated with a group
$g(i)$ and a corresponding latent variable $\theta_{g(i)}$, and Poisson model,
the likelihood is
$$
p(y \mid \theta, \phi) = \prod_i\text{Poisson} (y_i \mid \exp(\theta_{g(i)})).
$$
The arguments required to compute this likelihood are:

* `y`: an array of counts.
* `y_index`: an array whose $i^\text{th}$ element indicates to which
group the $i^\text{th}$ observation belongs to.

The signatures for the embedded Laplace approximation function with a Poisson
likelihood are
```
real laplace_marginal_poisson_log_lpmf(array[] int y | array[] int y_index,
vector theta0, function K_function, (...));

real laplace_marginal_tol_poisson_log_lpmf(array[] int y | array[] int y_index,
vector theta0, function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);

vector laplace_latent_poisson_log_rng(array[] int y, array[] int y_index,
vector theta0, function K_function, (...));

vector laplace_latent_tol_poisson_log_rng(array[] int y, array[] int y_index,
vector theta0, function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);
```


A similar built-in likelihood lets users specify an offset $x_i \in \mathbb R^+$
to the rate parameter of the Poisson. The likelihood is then,
$$
p(y \mid \theta, \phi) = \prod_i\text{Poisson} (y_i \mid \exp(\theta_{g(i)}) x_i).
$$
The signatures for this function are:
```
real laplace_marginal_poisson2_log_lpmf(array[] int y | array[] int y_index,
vector x, vector theta0,
function K_function, (...));

real laplace_marginal_tol_poisson2_log_lpmf(array[] int y | array[] int y_index,
vector x, vector theta0,
function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);

vector laplace_latent_poisson2_log_rng(array[] int y, array[] int y_index,
vector x, vector theta0,
function K_function, (...));

vector laplace_latent_tol_poisson2_log_rng(array[] int y, array[] int y_index,
vector x, vector theta0,
function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);
```


### Negative Binomial with log link

The negative Bionomial distribution generalizes the Poisson distribution by
introducing the dispersion parameter $\eta$. The corresponding likelihood is then
$$
p(y \mid \theta, \phi) = \prod_i\text{NegBinomial2} (y_i \mid \exp(\theta_{g(i)}), \eta).
$$
Here we use the alternative paramererization implemented in Stan, meaning that
$$
\mathbb E(y_i) = \exp (\theta_{g(i)}), \\
\text{Var}(y_i) = \mathbb E(y_i) + \frac{(\mathbb E(y_i))^2}{\eta}.
$$
The arguments for the likelihood function are:

* `y`: the observed counts
* `y_index`: an array whose $i^\text{th}$ element indicates to which
group the $i^\text{th}$ observation belongs to.
* `eta`: the overdispersion parameter.

The function signatures for the embedded Laplace approximation with a negative
Binomial likelihood are
```
real laplace_marginal_neg_binomial_2_log_lpmf(array[] int y |
array[] int y_index, real eta, vector theta0,
function K_function, (...));

real laplace_marginal_tol_neg_binomial_2_log_lpmf(array[] int y |
array[] int y_index, real eta, vector theta0,
function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);

vector laplace_latent_neg_binomial_2_log_rng(array[] int y,
array[] int y_index, real eta, vector theta0,
function K_function, (...));

vector laplace_latent_tol_neg_binomial_2_log_rng(array[] int y,
array[] int y_index, real eta, vector theta0,
function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);
```

### Bernoulli with logit link

Given binary outcome $y_i \in \{0, 1\}$ and Bernoulli model, the likelihood is
$$
p(y \mid \theta, \phi) = \prod_i\text{Bernoulli} (y_i \mid \text{logit}^{-1}(\theta_{g(i)})).
$$
The arguments of the likelihood function are:

* `y`: the observed counts
* `y_index`: an array whose $i^\text{th}$ element indicates to which
group the $i^\text{th}$ observation belongs to.

The function signatures for the embedded Laplace approximation with a Bernoulli likelihood are
```
real laplace_marginal_bernoulli_logit_lpmf(array[] int y |
array[] int y_index, real eta, vector theta0,
function K_function, (...));

real laplace_marginal_tol_bernoulli_logit_lpmf(array[] int y |
array[] int y_index, real eta, vector theta0,
function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);

vector laplace_latent_bernoulli_logit_rng(array[] int y,
array[] int y_index, real eta, vector theta0,
function K_function, (...));

vector laplace_latent_tol_bernoulli_logit_rng(array[] int y,
array[] int y_index, real eta, vector theta0,
function K_function, (...),
real tol, int max_steps, int hessian_block_size,
int solver, int max_steps_linesearch);
```

<!-- ## Draw approximate samples for out-of-sample latent variables. -->

<!-- In many applications, it is of interest to draw latent variables for -->
<!-- in-sample and out-of-sample predictions. We respectively denote these latent -->
<!-- variables $\theta$ and $\theta^*$. In a latent Gaussian model, -->
<!-- $(\theta, \theta^*)$ jointly follow a prior multivariate normal distribution: -->
<!-- $$ -->
<!-- \theta, \theta^* \sim \text{MultiNormal}(0, {\bf K}(\phi)), -->
<!-- $$ -->
<!-- where $\bf K$ designates the joint covariance matrix over $\theta, \theta^*$. -->

<!-- We can break $\bf K$ into three components, -->
<!-- $$ -->
<!-- {\bf K} = \begin{bmatrix} -->
<!-- K & \\ -->
<!-- K^* & K^{**} -->
<!-- \end{bmatrix}, -->
<!-- $$ -->
<!-- where $K$ is the prior covariance matrix for $\theta$, $K^{**}$ the prior -->
<!-- covariance matrix for $\theta^*$, and $K^*$ the covariance matrix between -->
<!-- $\theta$ and $\theta^*$. -->

<!-- Stan supports the case where $\theta$ is associated with an in-sample -->
<!-- covariate $X$ and $\theta^*$ with an out-of-sample covariate $X^*$. -->
<!-- Furthermore, the covariance function is written in such a way that -->
<!-- $$ -->
<!-- K = f(..., X, X), \ \ K^{**} = f(..., X^*, X^*), \ \ K^* = f(..., X, X^*), -->
<!-- $$ -->
<!-- as is typically the case in Gaussian process models. -->





<!-- The -->
<!-- function `laplace_latent_rng` produces samples from the Laplace approximation -->
<!-- and admits nearly the same arguments as `laplace_marginal`. A key difference -->
<!-- is that -->
<!-- ``` -->
<!-- vector laplace_latent_rng(function ll_function, tupple (...), vector theta_0, -->
<!-- function K_function, tupple (...)); -->
<!-- ``` -->