1 Reminders

This section is dedicated to a comprehensive overview of the prerequisites to the inference of a Poisson log-normal model with auto-regressive Gaussian latent vectors. The theoretical context is that of generalized linear models (GLM) with latent variables, and graphical models.

1.1 Generalised linear models & latent variables

1.1.1 Exponential families

The exponential families form very a classical, but broad, class of statistical models. They are at the core of Generalised Linear Models, and their inference is usually carried out through Maximum Likelihood Estimation (MLE). The concept of exponential families is credited to E. J. G. Pitman, G. Darmois, and B. O. Koopman in 1935–1936.

Definition 1.1 (Exponential families (Murphy 2023), Section 2.4) Let \(\mu\) be a \(\sigma\)-finite measure on a measurable space \(\mathcal{Y}\subseteq \mathbb{R}^p\). Consider a family \(P_\theta, \theta \in \mathbb{R}^d\) of probability measures on \(\mathcal{Y}\). This family \(P_\theta, \theta \in \mathbb{R}^d\) is an exponential family if its densities with respect to \(\mu\) can be written in the following way: \[ p_\theta (y) = \frac{1}{Z (\theta)} h (y) \exp \left[ \langle \theta, T (y) \rangle \right] = h (y) \exp \left[ \langle \theta, T (y) \rangle - A (\theta) \right] \] where:

the function \(Z: \mathbb{R}^d \rightarrow \mathbb{R}_+^*\) (resp. \(A: \theta \in \mathbb{R}^d \mapsto \log Z (\theta)\)) is called the partition function (resp. log-partition function);
the function \(T: \mathcal{Y}\rightarrow \mathbb{R}^d\) is the sufficient statistic;
the function \(h: \mathcal{Y}\rightarrow \mathbb{R}\) is the base measure, often we will have \(h=1\).

Example 1.1 Many statistical models using usual distributions are exponential families. Among them:

Bernoulli family
Binomial \(\mathcal{B}(n,p)\) family with known \(n\)
Multivariate Gaussian family
Exponential family
Poisson family
Geometric family

but there are also non-examples:

Uniform distributions with unknown bounds
Cauchy family (hence logistic)
Hyper-geometric family
Student family
Most family of mixtures.

The intuition is that we must be able to separate the parameter and the variable by factorisation.

Proposition 1.1 (Log-partition function and cumulants) Let \(Y: \Omega \rightarrow \mathbb{R}^p\) be a random variable in an exponential family. Then: \[ \begin{aligned} \nabla A (\theta) &= \mathbb{E}_\theta [ T (Y) ] \\ \nabla^2 A (\theta) &= \mathrm{Cov}\,_\theta [ T(Y) ] \end{aligned} \]

1.1.2 Generalised linear models

The exponential families can be used to generalise the classical linear regression model with normally distributed error, conditionally on the covariates.

Let \(X: \Omega \rightarrow \mathcal{X}\subseteq \mathbb{R}^d\) and \(Y: \Omega \rightarrow \mathcal{Y}\subseteq \mathbb{R}^p\) be random variables. A generalised linear model consists of three elements:

An unknown parameter matrix \(\mathbf{B}= (\beta_1 | \dots | \beta_p) \in \mathbb{R}^{d \times p}\);
A dispersion parameter \(\sigma\);
An over-dispersed natural exponential family modelling the distribution of \(Y\) conditionally on \(X\), whose parameter is the linear predictor \(\theta = \mathbf{B}^\top X\) and sufficient statistic is the identity.

The notion of generalised linear model was first investigated by Wedderburn and Nelder (1972). For more details on the matter, see Dobson and Barnett (2018).

Definition 1.2 (GLM (Murphy 2023), Chapter 15) A generalized linear model or GLM is the combination of two variables \(X: \Omega \rightarrow \mathcal{X}\subseteq \mathbb{R}^d\) and \(Y: \Omega \rightarrow \mathcal{Y}\subseteq \mathbb{R}^p\) with the following conditional density for \(Y|X\) parametered by \(\mathbf{B}\in \mathbb{R}^{d \times p}, \sigma > 0\): \[ p_{\mathbf{B}, \sigma} ( y \mid x ) = h (y, \sigma^2 ) \exp \left[ \frac{\langle \mathbf{B}^\top x, y \rangle - A ( \mathbf{B}^\top x )}{\sigma^2} \right] \]

By Proposition 1.1, we have the following link between the log-partition function and the conditional expectation \(\mathbb{E}[ Y | X ]\): \[ \mathbb{E}[ Y | X ] = \nabla A ( \mathbf{B}^\top X ) \]

The link function \(g\) such that \(g( \mathbb{E}[ Y | X ] ) = \mathbf{B}^\top X\) verifies: \[ g^{-1} = \nabla A \]

Example 1.2 (Poisson regression) If we have \(Y \in \mathbb{N}\) we can use: \[ p_\beta (y | x) = e^{- \mu} \frac{\mu^y}{y!} \text{ where } \mu = e^{\langle x, \beta \rangle} \] Then \(A ( \langle x, \beta \rangle ) = \mu = e^{ \langle x, \beta \rangle }\) and \(h(y) = - \log (y!)\).

1.1.3 Latent variables & Expectation-Maximisation (EM)

When the observed data is complete, the estimation of a GLM can be carried out through simple MLE (Murphy 2023, sec. 15.1.3): the calculation are explicit and the optimisation often relies on the Iteratively Reweighted Least Squares algorithm or Stochastic Gradient Descent.

However, for the Poisson log-normal model we need incomplete data models, i.e. hidden variables. They have studied extensively in Robin (2018). In addition to the the observed response variable \(Y\), we assume that there a hidden variable \(Z\).

Then, the calculation of the MLE is not explicit anymore, but in some cases the classical Expectation-Maximisation algorithm (Dempster, Laird, and Rubin 1977) solves this issue.

The EM algorithm is an iterative procedure that can be described as follows:

(Initialisation) Choose \(\widehat\theta^{(0)} \in \Theta\).
(E-step) For each \(\theta \in \Theta\), compute the conditional expectation of the complete log-likelihood conditionally on the observed data \(Y\) with the current estimate \(\widehat\theta^{(h)}\) of the parameters:

\[ Q (\theta, \widehat\theta^{(h)}) := \mathbb{E}_{\,\widehat\theta^{(h)}} [ \ell (\theta; Y, Z) | Y ] \]
(M-step) Find, if it exists, a new estimator \(\widehat\theta^{(h+1)} \in \Theta\) of \(\theta\) that maximises said expectation:

\[ \widehat\theta^{(h+1)} \in \underset{\theta \in \Theta}{\mathrm{argmax}\,} Q (\theta, \widehat{\theta}^{(h)}) \]

This way we construct a family \((\widehat{\theta}^{(h)})_h\) of estimators of \(\theta\).

1.2 Poisson log-normal model

In this section, we define the Poisson log-normal distribution and model. We will see that the Expectation-Maximisation algorithm is unavailable, since we are unable to produce a closed form of the moments of \(p_\theta ( Z | Y )\). This comes from the fact that PLN is not a GLM, but rather a GLMM (Generalised linear mixed model).

1.2.1 Definition

Definition 1.3 (PLN distribution (Aitchison and Ho 1989), Section 2) For \(\mu \in \mathbb{R}^p\) and a \(p \times p\)-positive-definite matrix \(\boldsymbol{\Sigma}\), the Poisson log-normal distribution, denoted \(\mathrm{PLN}( \mu, \boldsymbol{\Sigma})\) is a discrete distribution on \(\mathbb{N}^p\) whose pmf is:

\[ p(y) = \int_{\mathbb{R}^p} \prod_{j=1}^{j=p} f \left( y_j; \exp{z_j} \right) \, g \left( z; \mu, \boldsymbol{\Sigma}\right) \, dz \]

where \(f ( \cdot; \lambda )\) is the pmf of the Poisson \(\mathcal{P}( \lambda )\) and \(g \left( \cdot; \mu, \boldsymbol{\Sigma}\right)\) is the pdf of the Gaussian \(\mathcal{N}_p \left( \mu, \boldsymbol{\Sigma}\right)\) distribution.

Remark.

It is the distribution of \(Y\) when \(Z \sim \mathcal{N}_p (0, \boldsymbol{\Sigma})\): \[ Y_j | Z \overset{\perp}{\sim} \mathcal{P}\left( \exp (\mu_j + Z_j) \right), 1 \leq j \leq p \]
As a multivariate regression model, with \(\mu = \mathbf{B}^\top X\), it is not a GLM but Generalised Linear Mixed Model because of the latent variable \(Z\).

1.2.1.1 Comparing Poisson and PLN pmfs

Code

# Define parameters
n <- 1e4
p <- 6
mu <- rep(log(20), p)
Sigma <- diag(c(1, 2, 1.3, 3, 1, 0.4))
Sigma[1, 2] <- 1
Sigma[2, 1] <- 1
depths <- rowSums(exp(rep(1, n) %o% diag(Sigma) / 2 + mu)) # Null offsets

# Compute empirical mass function
epmf <- function(i, values) {
  sum(values == i) / n
}

# Generate PLN and Poisson data
pln <- data.frame(rPLN(n, mu, Sigma, depths))

mode <- which.max(sapply(
  1:max(pln$Y1),
  function(x) epmf(x, matrix(pln$Y1))
))

Y0 <- data.frame(Y0 = rpois(n, lambda = mode))

y <- data.frame(pln |> select(Y1, Y2), Y0)

quot <- function(i) {
  x <- epmf(i, Y0)
  y <- epmf(i, matrix(pln$Y1))
  if (y > 0) {
    dnorm(log(mode), mu[1], Sigma[1, 1]) * x / y
  } else {
    0
  }
}

# Compare pmf between PLN and Poisson
p0 <- ggplot(y |> select(Y1, Y0) |> gather()) +
  stat_count(aes(x = value, y = after_stat(prop), fill = key),
    position = position_dodge()
  ) +
  xlim(-0.5, 30.5)

p0

Figure 1.1: Comparing PLN and Poisson pmfs with \(Y_1 \sim \mathrm{PLN}(\mu = \log (20), \sigma = 1), Y_2 \sim \mathcal{P}(\lambda = \operatorname{mode} (\mathrm{PLN}(\mu, \sigma)))\).

Remark. There is clearly an over-dispersion phenomenon, compared to the Poisson distribution.

1.2.1.2 Example of multivariate PLN data

Code


get_density <- function(x, y, ...) {
  dens <- MASS::kde2d(x, y, ...)
  ix <- findInterval(x, dens$x)
  iy <- findInterval(y, dens$y)
  ii <- cbind(ix, iy)
  return(dens$z[ii])
}

# geom_hex

# Multivariate plot with marginal pmfs
p <- ggplot(
  y |>
    group_by(Y1, Y2) |>
    reframe(Y0, freq = n() / n) |>
    ungroup(),
  aes(x = Y2, y = Y1)
) +
  geom_point(aes(color = freq)) +
  scale_color_viridis() +
  theme(legend.position = "left") +
  coord_fixed(ratio = 1) +
  xlim(0, 100) +
  ylim(0, 75)
p1 <- ggMarginal(p, type = "histogram")
p1

Figure 1.2: Multivariate \(\mathrm{PLN}\left( \mu = \begin{pmatrix} \log (20) \\ \log(20) \end{pmatrix}, \boldsymbol{\Sigma}= \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix} \right)\).

1.2.1.3 Remarks

Remark. When modelling real data, the regression layer is corrected by the sampling effort (offset): ratio of sample over whole population.

Why is the Poisson log-normal useful for abundance tables (e.g Joint Species Distribution Models)?

Poisson: we count occurrences (sites), which happen rarely and independently (for instance number of animals appearing at a particular site):

The name `law of rare event’ may be misleading because the total count of success events in a Poisson process need not be small if the parameter \(np\) is not small. For example, the number of telephone calls to a busy switchboard in one hour follows a Poisson distribution with the events appearing frequent to the operator, but they are rare from the point of view of one average member of the population who is very unlikely to make a call to that switchboard in that hour.
Multivariate: At each site, we want to count multiple species simultaneously (abundance vector) while accounting for their correlation: we need a multivariate distribution whose marginal are Poisson distribution and whose covariance matrix is arbitrary, which is not easy to design (without forcing non-negative covariances).
Log-normal: \(\log\) is the link function, normal because we want to make some calculations.

1.2.2 Properties

There are simple expression for the first two cumulants of the PLN distribution:

Expectations: \[ \mathbb{E} Y_{ij} = \mathbb{E}\, \mathbb{E}[ Y_{ij} | Z_{ij} ] = \mathrm{e}\,^{ \mu_{ij} + \sigma_{jj}/2 } \]
Variances: \[ \mathrm{Var}\,Y_{ij} = \mathbb{E}\, \mathrm{Var}\,( Y_{ij} | Z_{ij} ) + \mathrm{Var}\,( \mathbb{E}[ Y_{ij} | Z_{ij} ] ) = \mathbb{E}Y_{ij} + \mathbb{E}[ Y_{ij} ]^2 \, \left( \mathrm{e}\,^{\sigma_{ij}/2} - 1 \right) \]

Or they be written using a matrix notation: \[ \mathbb{E}[Y] = \exp \left( \mu + \frac{1}{2} \mathrm{diag}\,\boldsymbol{\Sigma}\right) \] where \(\mathrm{diag}\,\boldsymbol{\Sigma}\in \mathbb{R}^p\) denotes the vector of diagonal entries of \(\boldsymbol{\Sigma}\) and \(\exp\) is taken entry-wise. \[ \mathrm{Var}\,(Y) = \Delta_Y + \Delta_Y \left( \exp \boldsymbol{\Sigma}- J \right) \Delta_Y \] where \(\Delta_Y = \mathrm{diag}\,\mathbb{E}[Y]\) is the \(p \times p\)-diagonal matrix whose entries are the vector \(\mathbb{E}[Y]\).

We notice that, if \(Z\) is i.i.d, \(Y\) is decorrelated.

Also, we prove the over-dispersion phenomenon noticed earlier: \(\mathrm{Var}\,Y_{ij} > \mathbb{E}Y_{ij}\).

1.3 Generalised linear mixture models

1.3.1 Definition

A generalised linear mixture model is a type of hierarchical model. It generalises the GLM approach by adding a random effect \(Z\) to the fixed effect \(\mathbf{B}^\top X\).

Definition 1.4 (GLMM) A Generalised linear mixed model is the combination of three variables \(X\), \(Y\) and \(Z \sim \mathcal{N}_p (0, \boldsymbol{\Sigma})\) such that \(X\) and \(Z\) are independent and, conditionally on \(X\) and \(Z\), \(Y\) has its distribution in an exponential family of the form: \[ p_\beta ( y \mid x, z ) = h (y) \exp \left[ \frac{ \langle \mathbf{B}^\top x + z, y \rangle - A ( \mathbf{B}^\top x + z )}{\sigma^2} \right] \]

1.3.2 Variational Expectation-Maximisation (VEM)

The EM algorithm is actually a special case of the Minimisation-Maximisation or MM algorithm, which is a general iterative procedure where the E-step is replaced by a minimisation step over the set of all possible distributions for \(Z|Y\).

If we restrict this set to some smaller parametric set \(\mathcal{Q}= \left\{Q_\psi, \psi \in \Psi \subseteq \mathbb{R}^N \right\}\) of variational distributions to approximate the conditional distribution of \(Z|Y\), we get the variational EM or VEM algorithm. This induces lower-bounding the log-likelihood by the evidence lower-bound (ELBO):

\[ J (\theta, q; Y) = \mathbb{E}_q [ \log p_\theta ( Y, Z ) - \log q ( Z ) ] \]

The VEM algorithm is described as follows:

(Initialisation) Choose \((\widehat{\theta}^{(0)}, \psi^{(0)})\).
(VE-step) Compute the ELBO with current estimates \((\widehat{\theta}^{(h)}, \psi^{(h)})\) of the parameters:

\[ \psi^{(h+1)} \in \underset{\psi \in \Psi}{\mathrm{argmin}\,} J(\widehat{\theta}^{(h)}, q_\psi, Y) \]
(M-step) Find the new estimated parameters \(\widehat{\theta}^{(h+1)}\) that maximise said ELBO.

\[ \widehat\theta^{(h+1)} \in \underset{\theta \in \Theta}{\mathrm{argmax}\,} J (\theta, q_{\psi^{(h)}}, Y) \]

If we choose the parametric set of Gaussian distributions as set of variational distributions: \[ \mathcal{Q}= \left\{ \left. Q_\psi = Q_{\psi_1} \dots Q_{\psi_n} \right| Q_{\psi_i} = \mathcal{N}_p ( m_i, \mathbf{S}_i ), \psi_i = (m_i, \mathbf{S}_i) \in \mathbb{R}^p \times \mathbb{R}^{p \times p}, 1 \leq i \leq n \right\} \] then the parameter \(m\) (resp. \(\mathbf{S}\)) can be seen as an estimator of the conditional expectation \(\mathbb{E}[ Z | Y]\) (resp. conditional variance \(\mathrm{Var}\,[ Z | Y ]\)).

1.3.2.1 Example

Let us show, using the PLNmodels package, an implementation of the VEM algorithm to data generated with the rPLN function:

Code


row.names(pln) <- 1:n
offsets <- matrix(1, n, 1)
row.names(offsets) <- 1:n
colnames(offsets) <- "offsets"
data <- prepare_data(pln, offsets, offsets)

fit <- PLN(Abundance ~ 1, data)
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>  DONE!
fit$print()
#> A multivariate Poisson Lognormal fit with full covariance model.
#> ==================================================================
#>  nb_param    loglik       BIC       ICL
#>        27 -269519.1 -269643.5 -261312.9
#> ==================================================================
#> * Useful fields
#>     $model_par, $latent, $latent_pos, $var_par, $optim_par
#>     $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria
#> * Useful S3 methods
#>     print(), coef(), sigma(), vcov(), fitted()
#>     predict(), predict_cond(), standard_error()
fit$model_par
#> $B
#>                   Y1       Y2       Y3       Y4       Y5       Y6
#> (Intercept) 2.978341 3.001901 3.028961 3.058163 3.004381 2.986443
#> 
#> $Sigma
#>              Y1            Y2            Y3            Y4           Y5
#> Y1  1.010688885  0.9977050040 -0.0039137053 -0.0162024094 -0.003823876
#> Y2  0.997705004  1.9956929975 -0.0147551093  0.0009999992  0.005746798
#> Y3 -0.003913705 -0.0147551093  1.3004061007 -0.0077305504 -0.009989727
#> Y4 -0.016202409  0.0009999992 -0.0077305504  2.9106765576  0.013224834
#> Y5 -0.003823876  0.0057467977 -0.0099897271  0.0132248339  0.975630382
#> Y6 -0.002438751 -0.0011835911  0.0007579438  0.0101080757  0.001634181
#>               Y6
#> Y1 -0.0024387514
#> Y2 -0.0011835911
#> Y3  0.0007579438
#> Y4  0.0101080757
#> Y5  0.0016341809
#> Y6  0.3985317563
#> 
#> $Omega
#>              Y1           Y2           Y3           Y4           Y5
#> Y1  1.954080094 -0.976977171 -0.005041975  0.011109458  0.013196693
#> Y2 -0.976977171  0.989587381  0.008182928 -0.005703538 -0.009492254
#> Y3 -0.005041975  0.008182928  0.769140698  0.001981930  0.007783198
#> Y4  0.011109458 -0.005703538  0.001981930  0.343682501 -0.004546752
#> Y5  0.013196693 -0.009492254  0.007783198 -0.004546752  1.025234078
#> Y6  0.008729880 -0.002871480 -0.001551517 -0.008651000 -0.004050894
#>              Y6
#> Y1  0.008729880
#> Y2 -0.002871480
#> Y3 -0.001551517
#> Y4 -0.008651000
#> Y5 -0.004050894
#> Y6  2.509494203
#> 
#> $Theta
#>    (Intercept)
#> Y1    2.978341
#> Y2    3.001901
#> Y3    3.028961
#> Y4    3.058163
#> Y5    3.004381
#> Y6    2.986443

Code

from scipy import integrate
from scipy.stats import multivariate_normal, poisson
import numpy as np

reals = [-np.Inf,np.Inf]

norm = lambda z1,z2 : multivariate_normal.pdf([z1,z2], mean=np.array([0,0]))

pois = lambda y, mu : poisson.pmf(y, mu)

y1,y2=0,0

def p(args):
  h = lambda z1,z2 : pois(args[0], np.exp(z1))*pois(args[1],np.exp(z2)) * g(z1,z2)
  return(integrate.nquad(h, [reals, reals], opts = { 'epsabs': 0.01}))

indices = np.indices([2,2]).transpose(1,2,0).reshape(-1,2)

np.sum([p(x) for x in indices])

integrate.nquad(g, [reals, reals])

1.3.3 Convergence of the \(\mathrm{VEM}\) estimator

It was shown by the PLN team that, as a M-estimator, and under some classical assumptions:

(A1) Assume that the parameter space \(\Theta\) of \(\theta\) is compact;
(A2) Assume that the variational parameter space \(\boldsymbol{\Psi}\) of the variational parameters \(\psi\) is bounded;

then the VEM estimator is:

consistent around an an unknown parameter \(\bar{\theta}\) which might differ from the original parameter \(\theta\)
but simulations suggest it is nearly unbiased according to the original parameter \(\theta\).

Theorem 1.1 (Consistency of \(\hat{\theta}\)) Under assumptions \((A1) - (A2)\), assume that the map \(\theta \rightarrow \mathbb{E}[ \sup_{\psi \in \Psi}{ J(\theta, q_\psi, Y)} ( \theta; Y) ]\) attains a finite global maximum at \(\bar{\theta}\) (which can be different from the true parameter \(\theta^{\star}\)). Then \(\hat{\theta} \rightarrow \bar{\theta}\) under \(P_{\theta^{\star}}\).

Another general approach can be found in Gunawardana and Byrne (2005) where they define the notion of Generalised Alternating Minimisation procedures.

1.4 Auto-regressive (AR) multivariate Gaussian processes

In this subsection, we define the stationary multivariate auto-regressive distribution. The notion dates back to Udny Yule and Gilbert Walker, who designed the AR process of order \(q\) around 1930.

Definition 1.5 (\(\mathrm{AR}(q)\)) \[ Z_i = \sum_{k=1}^q \varphi_k Z_{i-k} + \varepsilon_i \] where the \(\varepsilon_i, 1 \leq i \leq n\) are i.i.d. Gaussian variables.

We use the multivariate generalisation in dimension \(p\).

Definition 1.6 (\(\mathrm{AR}_p (q)\)) \[ Z_i = \sum_{k=1}^q \boldsymbol{\Phi}_k Z_{i-k} + \varepsilon_i \] where the \(\varepsilon_i, 1 \leq i \leq n\) are i.i.d. multivariate Gaussian variables in \(\mathbb{R}^p\).

We restrict to the particular case of order \(q=1\) where we can easily define a stationarity condition so that every vector has the same first two moments.

Definition 1.7 (\(\mathrm{SAR}(\mu, \boldsymbol{\Sigma}, \boldsymbol{\Phi})\)) If the parameters \(\boldsymbol{\Sigma}\) and \(\boldsymbol{\Phi}\) verify the stationarity condition: \[ \boldsymbol{\Sigma}_\varepsilon = \boldsymbol{\Sigma}- \boldsymbol{\Phi}\boldsymbol{\Sigma}\boldsymbol{\Phi}^\top \succ 0\] then the centered random vectors \(Z_1, \dots, Z_n: \Omega \rightarrow \mathbb{R}^p\) are said to follow the \(\mathrm{SAR}(\mu, \boldsymbol{\Sigma}, \boldsymbol{\Phi})\) distribution if they follow an \(\mathrm{AR}_p (1)\) distribution: \[ Z_1 \sim \mathcal{N}_p (\mu, \boldsymbol{\Sigma}) \quad \text{and} \quad Z_i = \boldsymbol{\Phi}Z_{i-1} + \varepsilon_i \] where \(\mu_\varepsilon = \mu - \boldsymbol{\Phi}\mu\) and the \(\varepsilon_i \sim \mathcal{N}_p ( \mu_\varepsilon, \boldsymbol{\Sigma}_\varepsilon ), 1 \leq i \leq n\) are i.i.d.

Remark.

We have \(Z_i \sim \mathcal{N}_p (\mu, \boldsymbol{\Sigma}), 1 \leq i \leq n\).
If \(\boldsymbol{\Phi}= \mathbf0\), \(Z_i = \varepsilon_i, 1 \leq i \leq n\) are i.i.d.
If \(\boldsymbol{\Phi}= I_p\), \(Z_1 = \dots = Z_n\) are equal.

One can easily compute the full covariance matrix: \[ \begin{pmatrix} \boldsymbol{\Sigma}& \boldsymbol{\Sigma}\boldsymbol{\Phi}^\top & \cdots & \boldsymbol{\Sigma}\left( \boldsymbol{\Phi}^n \right)^\top \\ \boldsymbol{\Phi}\boldsymbol{\Sigma}& \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & \boldsymbol{\Sigma}\boldsymbol{\Phi}^\top \\ \boldsymbol{\Phi}^n \boldsymbol{\Sigma}& \cdots & \boldsymbol{\Phi}\boldsymbol{\Sigma}& \boldsymbol{\Sigma} \end{pmatrix} \] from which we deduce the existence unicity of the stationary auto-regressive distribution \(\mathrm{SAR}(\mu, \boldsymbol{\Sigma}, \boldsymbol{\Phi})\), since all marginal distributions are Gaussian.

1.5 \(\mathrm{KL}\)-divergence between two Gaussian multivariate distributions

Proposition 1.2 We denote the norm according to a positive-definite symmetric matrix \(\boldsymbol{\Sigma}\):

\[ \| x \|_{\boldsymbol{\Sigma}}^2 = \langle \boldsymbol{\Sigma}^{-1} x, x \rangle \]

Then:

\[ \mathrm{KL} \left( \mathcal{N}_p ( \mu_0, \boldsymbol{\Sigma}_0 ) \| \mathcal{N}_p ( \mu_1, \boldsymbol{\Sigma}_1 ) \right) = \frac{1}{2} \left[ \| \mu_0 - \mu_1 \|_{\boldsymbol{\Sigma}_1}^2 - \log \left| \boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_0 \right| + \operatorname{Tr} \left( \boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_0 \right) - p \right] \]