To better understand the PLN distribution, and also to understand the impact of the VEM algorithm on the inference of the model, a naive idea is to compute the expectation and covariance matrix.
We can easily solve the system and find the unknown parameters: \[ \boldsymbol{\Sigma}= \log \left( J + \Delta_Y^{-1} \left( \boldsymbol{\Sigma}_Y - \Delta_Y \right) \Delta_Y^{-1} \right) \]\[ \mu = \log \mu_Y - \frac{1}{2} \mathrm{diag}\,\boldsymbol{\Sigma}\] then estimate \(\boldsymbol{\Sigma}_Y\) and \(\mu_Y\) by their respective empirical means.
2.1.1 Theoretical performance analysis
Two problems arise:
The data must verify the over-dispersion hypothesis empirically to be able to compute the logarithm.
It seems hard to control the bias created by estimating \(\Delta_Y^{-1}\). By using a Taylor series expansion, one can prove that is impossible to estimate the inverse of a Poisson parameter without bias.
The moments estimators perform badly in comparison to the VEM estimators, especially as there is no reason that the empirical moments verify the over-dispersion condition.
In order to obtain better results, we could try to remove the zeroes in the counts, then estimate the Poisson parameters, before adding the correct number of zeroes.
However, they could be an alternative for the initialization of the VEM algorithm.
2.2 The new model
\[ (Z_1, \dots, Z_n) \sim \mathrm{SAR}(\mu, \boldsymbol{\Sigma}, \boldsymbol{\Phi}) \] and \[ Y_{ij} | Z_i \overset{\perp}{\sim} \mathcal{P}( \exp Z_{ij}), 1 \leq j \leq p, 1 \leq i \leq n \]
If \(\boldsymbol{\Phi}= \mathbf0\), then we retrieve the classical PLN model.
2.3 Computing the modified ELBO
Let us take the same approach as for non-AR PLN, and choose the set \(\mathcal{Q}\) of Gaussian variational distributions. This means that we will try to have correct results with the simplifying hypothesis that the variational variables \(Z_i \sim Q_{\psi_i}, 1 \leq i \leq n\) are mutually independent.
If this fails, there are other approaches, like choosing an AR structure for the variational variables.
Let us denote \(\theta = (\mu, \boldsymbol{\Sigma})\) and \(\psi = (m_1, \dots, m_n, \mathbf{S}_1, \dots, \mathbf{S}_n)\) such that \(Q_\psi \in \mathcal{Q}\).
Also, to be more general we place ourselves in regression, with \(\mathbf{B}= (\beta_1 | \dots | \beta_p) \in \mathbb{R}^{d \times p}\) and \(\mu_i= \mathbf{B}^\top X_i, 1 \leq i \leq n\) (for the non-conditional version, it suffices to take \(\mu_1 = \dots = \mu_n = \mu\)). Then: \[
\begin{aligned}
J ( \theta, \psi; \mathbf{Y})
&= \mathbb{E}_\psi [ \ell_\theta ( \mathbf{Y}| \mathbf{Z}) + \ell_\theta (\mathbf{Z}) - \ell_\psi (\mathbf{Z}) | \mathbf{Y}] \\
&= \sum_{i=1}^n \mathbb{E}_{\psi_i} [ \ell ( \psi; Y_i | Z_i ) | Y_i ] + \mathbb{E}_{\psi_{i-1}, \psi_i} [ \ell (\theta, \psi_{i-1}, \psi_i; Z_{i-1}) ] - \mathbb{E}_\psi [ \ell (\psi; \mathbf{Z}) ] \\
J ( \theta, \psi; \mathbf{Y}) &= \sum_{i=1}^{n} J_i ( \theta, \psi_{i-1}, \psi_i; Y_i)
\end{aligned}
\]
Remark. Fortunately, when we force \(\boldsymbol{\Phi}=0\), we retrieve the non-AR ELBO.
2.4 Implementation
The PLNmodels package is written in R, with two backends, one in torch, the other in C++ using nlopt.
2.4.1 Initialisation
There are multiple possibilities. Either we start with a VE-step and we initialise \(\beta\), \(\boldsymbol{\Sigma}\) and \(\boldsymbol{\Phi}\), by:
the method of moments
their estimation through non-AR VEM
a non-zero random value (in the case of \(\boldsymbol{\Phi}\))
or we start with an M-step and we initialise \(m_1, \dots, m_n, \mathbf{S}_1, \dots, \mathbf{S}_n\) and \(\beta\) using a Poisson regression, which is the current choice for the initialization of the non-AR VEM algorithm.
For \(\boldsymbol{\Phi}\), we could also use AR estimation techniques, after estimating \(\mathbf{Z}\) with non-AR VEM or Poisson regression.
2.4.2 M-step
Since there were explicit expressions for the zeroes of the objective function for classical PLN, we should hope the same goes for PLN-AR. But the AR parameter \(\Phi\) complicates the differentiation. Explicit expressions are useful because they allow us to compute the updates for the C++ backend, and a profiled ELBO, useful to speed up the torch backend.
Let us denote \(\widetilde{\mathbf{S}} = \left( s_1 | \dots | s_n \right)\) the matrix whose rows are the diagonals of the \(\mathbf{S}_1, \dots, \mathbf{S}_n\).
In dimension \(1\), the conditional distribution is almost Gaussian around \(\log (Y)\). Using a Taylor development, we obtain expressions of its expectation and its variance with the Lambert \(W\) implicit functions:
\(\mu (y) = y - W(e^y) = \log y - \frac{\log y}y + o(1)\)
\(\sigma^2 (y) = (1 + W(e^y))^{-1}\)
In dimension \(p>1\), it is hard to compute the parameters of this Gaussian approximation.