Figure 1: Daily average temperature per station in Finland on several days of 2024. Total size: 190 stations \(\times\) 35 years \(\times\) 365 days > 2 million observations. Source: Weather observations by the Finnish Meteorological Institute.
Figure 2: Yearly trend of daily average temperature between 1990 and 2024 for the stations in 3 regions of Finland. Computed in 2 steps: (1) Maximum Penalised Likelihood density estimation (Silverman 1982) per station and per year. The smooth densities are compositional splines and belong to the Bayes Hilbert space \(B^2 ([a,b])\)(Van Den Boogaart, Egozcue, and Pawlowsky-Glahn 2014)\(\rightarrow\) 190 \(\times\) 35 functional observations. (2) Regression of density function on year per station \(\rightarrow\) 190 functional trends of temperature change.
What if \(X\) is a functional (or distributional) object?
Idea 1: back to a coordinate vector
Find a way to send the functional object \(X\) to some space \(\mathbb R^p\), using either
its \(y\)-values on a discretisation of the \(x\)-axis; or
the coordinates of an approximation in a basis of a space \(E\) of functions. How to choose the space? Usually some finite dimensional subspace of \(L^2 (a,b)\). How to choose the approximation? This process is called smoothing. Note: (a) is actually a particular case of (b) for a certain choice of basis.
Apply ICS to the coordinate vector that was obtained.
Problems:
How to choose the basis? The output of ICS (eigenelements, ICs) might depend on it, but why should it?
How to link the output of ICS to the original functional object \(X\)? In particular, the eigenvectors might be distorted in the chosen basis is not orthonormal.
Idea 2
Still find an approximation of \(X\) in a finite-dimensional functional space \(E\).
But define ICS directly on the space \(E\) in a coordinate-free manner:
Dependence on choice of units \(\rightarrow\) Affine invariance.
Dependence on choice of basis \(\rightarrow\) Coordinate freedom.
Multivariate ICS
Coordinate-free ICS
the space \(\mathbb R^p\)
a \(p\)-dimensional Euclidean space \(E\)
a random vector \(X \in \mathbb R^p\)
a random object \(X \in E\)
two scatter matrices \(S_1[X]\) and \(S_2[X]\)
two scatter operators (linear maps) \(S_1[X]\) and \(S_2[X]\)
a matrix of eigenvectors \(H\)
an eigenbasis \(H = (h_1, \dots, h_p)\) of \(E\)
eigenvalues \((\lambda_1, \dots, \lambda_p)\)
unchanged
invariant coordinates
unchanged
Observations from a random object in a Euclidean space
Theory
Let \((E, \langle \cdot, \cdot \rangle)\) be a \(p\)-dimensional Euclidean space (abstract vector space with an inner product). Why a Euclidean space? In order to define scatter operators, an inner product is required to have a notion of symmetry and positiveness.
Let \(X\) be an \(E\)-valued random object, from which we observe a data set \(D = (x_1, \dots, x_n)\).
Finland application
Code
load("data/finland/fint.RData")fint_trend |>plot_funs(tday) +ylim(NA, 0.1) +labs(x ="Temperature (deg. Celsius)", y ="Density")
Figure 3: The data set \(D\) of trends of temperature change in Finland in a 11-dimensional space \(E\) of compositional splines.
Two scatter operators
Theory
We need to find two scatter operators \(S_1[X]\) and \(S_2[X]\) that are symmetric positive-definite linear operators on \(E\).
The \(w\)-weighted covariance operator is defined by \[\begin{multline*}\langle \operatorname{Cov}_w [X] x, y \rangle \\
= \mathbb E [ w^2(\| \operatorname{Cov}[X]^{-1/2} (X - \mathbb E X) \|) \\
\langle X - \mathbb EX, x \rangle \langle X - \mathbb EX, y \rangle ]
\end{multline*}\] where \((x,y) \in E^2\) and \(w : \mathbb R^+ \rightarrow \mathbb R\) is a measurable function.
It is an affine equivariant scatter operator and its definition is completely coordinate-free: it exists on every Euclidean space.
Example:\(S_1 = \operatorname{Cov}\) for \(w=1\) and \(S_2 = \operatorname{Cov}_4\) for \(w = \operatorname{id}\).
Figure 4: Empirical estimates of the scatter operators \(\operatorname{Cov} [X]\) and \(\operatorname{Cov}_4 [X]\) for the data set on trends of temperature change in Finland.
Coordinate-free ICS
Setting: \((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\) Examples: \(\mathbb R^p\), square matrices, functions (polynomials, splines).
Figure 6: Mean trend of temperature change, shifted by each ICS eigendensity.
Proposition 1 (Reconstructing formula)\(X = \mathbb EX + \sum_{i=1}^p z_i h^*_i\) where \(\color{#009e73}{H^*} = (h^*_i)_{1 \leq i \leq p} = (\color{#0072b2}{S_1} [X] h_i)_{1 \leq i \leq p}\) is the dual basis of \(\color{#009e73}{H}\).
Figure 7: First two invariant coordinates coloured by region.
Figure 8: Last two invariant coordinates coloured by region.
A link between ICS problems
Proposition 2 Let \((E, \langle \cdot, \cdot \rangle_E) \overset{\varphi}{\rightarrow} (F, \langle \cdot, \cdot \rangle_F)\) be an isometry between two Euclidean spaces of dimension \(p\) and \(w_1, w_2 : \mathbb R^+ \rightarrow \mathbb R\) two measurable functions. For any integrable \(E\)-valued random object \(X\), the equality \[
\operatorname{Cov}_{w_\ell}^F [\varphi(X)] = \varphi \circ \operatorname{Cov}_{w_\ell}^E [X] \circ \varphi^{-1}
\] holds for \(\ell \in \{ 1, 2 \}\), as well as the equivalence between the following assertions, for any basis \(H = (h_1, \dots, h_p)\) of \(E\), and any finite non-increasing real sequence \(\Lambda = (\lambda_1 \geq \ldots \geq \lambda_p)\):
\((H, \Lambda)\) solves \(\operatorname{ICS} (X, \operatorname{Cov}_{w_1}^E, \operatorname{Cov}_{w_2}^E)\) in the space \(E\).
\((\varphi(H), \Lambda)\) solves \(\operatorname{ICS} (\varphi(X), \operatorname{Cov}_{w_1}^F, \operatorname{Cov}_{w_2}^F)\) in the space \(F\).
Select \(k\) components. Using the scree plot or D’Agostino normality tests, etc. Now compute the ICS distances: \[\sum_{j=1}^k |z_j|^2\]
Choose a cut-off \(\tau\). Based on Monte Carlo simulations to compute quantiles (Archimbaud, Nordhausen, and Ruiz-Gazen 2018b). We can avoid this step by drawing ROC curves instead.
Finland application
Code
tday_out <- dda::ICS_outlier(fint_trend$tday, index =1:2)fint_trend <- fint_trend |>mutate(outlying = tday_out$outliers ==1)fint_trend |>filter(outlying) |>plot_funs(tday, color = region) +ylim(NA, 0.1) +geom_function(fun =function(x) eval(c(mean(fint_trend$tday)), x), color ="black", linetype ="dotted") +labs(x ="Temperature (deg. Celsius)", y ="Density") +guides(color ="none")ggplot(finregions) +geom_sf(alpha =0) +geom_sf(data =filter(fint_trend, outlying), aes(color = region)) +scale_x_continuous(breaks =seq(from =-180, to =180, by =4)) +labs(color ="Region")
Figure 9: Outlying trends in temperature change in the Finland data set.
Figure 10: Ouytling stations on a map of Finland.
Comparison with other methods
Data generating process with outlier contamination
Figure 11: Three schemes to generate density data with outliers.
Tyler D. E., Critchley F., Dümbgen L. and Oja H. 2009. “Invariant co-ordinate selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3) : 549–92.
Van Den Boogaart K. G., Egozcue J. J. and Pawlowsky-Glahn V. 2014. “Bayes HilbertSpaces.”Australian & New Zealand Journal of Statistics 56 (2) : 171–94.
Let us decompose \(S_1[X]^{-1} (X - \mathbb EX)\) over the basis \(H\), which is orthonormal in \((E, \langle \cdot, S_1[X] \cdot \rangle)\):
\[\begin{aligned}
S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_i \rangle h_i \\
&= \sum_{i=1}^p \langle X - \mathbb EX, h_i \rangle h_i \\
S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_i h_i
\end{aligned}\]
The dual basis \(H^*\) of \(H\) is the one that satisfies \(\langle h_i, h^*_j \rangle = \delta_{ij}\) for all \(1 \leq i,j \leq p\) and we know from the definition of ICS that this holds for \((S_1[X] h_i)_{1 \leq i \leq p}\).
Proof. In ICS, the first step is to centre the data so wlog: \(\mathbb E [X] = 0\). \(E\) is isometric to \(\mathbb R^p\) via the linear isomorphism: \[\phi_B: x \mapsto G_B^{1/2} [x]_B\] Then for any \((x,y) \in E^2\): \[
\begin{aligned}
\langle \operatorname{Cov}_w [X] x, y \rangle_E
&= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\
&= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\
&= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\
&= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\
&= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle
\end{aligned}
\]
Bayes space and CB-splines
Source:Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).
\(P = \frac1{\lambda(I)} \lambda\) uniform probability measure over \(I=[a,b]\).
\(L^2_0 (P) = \{ f \in L^2 (P) | \int_I f \mathrm dP = 0 \}\) is a separable Hilbert space.
Centred log-ratio transform\(\operatorname{clr}: p \in \mathbb R^I \mapsto \log(p) - \int_I \log(p) \mathrm dP\) (if it exists)
Bayes space\(B^2 (P) = \{ \operatorname{clr}^{-1} (\{f\}), f \in L^2_0 (I) \}\) inherits the separable Hilbert space structure of \(L^2_0 (P)\). Each equivalence class contains a density function defined almost everywhere.
B-splines basis\(B=(B_i)_{-q \leq i \leq k}\) of \(\mathcal S_{q+1}^\Delta (I) \subset L^2 (P)\), the subspace of polynomial splines on \(I\) of order \(q+1\) and with knots \(\Delta\).
ZB-splines basis\(Z=(Z_i)_{-q+1 \leq i \leq k-1} = \left(\frac{\mathrm d B_i}{\mathrm dx}\right)_{-q+1 \leq i \leq k-1}\) of \(\mathcal Z_q^\Delta (I) = \mathcal S_q^\Delta (I) \cap L^2_0 (P)\)
CB-splines basis\(C = (C_i)_{-q+1 \leq i \leq k-1} = (\operatorname{clr}^{-1} (Z_i))_{-q+1 \leq i \leq k-1}\) of \(\mathcal C_q^\Delta (I)\) which is a Euclidean subspace of \(B^2 (P)\) of dimension \(p=k+q-1\).