Dimension Reduction on Distributional Data

55ᵉ Journées de Statistique, Université de Bordeaux

Camille Mondon

camille.mondon@tse-fr.eu

Anne Ruiz-Gazen

anne.ruiz-gazen@tse-fr.eu

Christine Thomas-Agnan

christine.thomas@tse-fr.eu

May 30, 2024

Outline

Introduction and distributional data
Coordinate-free ICS (Invariant Coordinate Selection)
Implementation
Application

Introduction

Aim: studying climate in Vietnam

Source: CPC Global Unified Temperature (NOAA PSL 1979), see also (Trinh, Simioni, and Thomas-Agnan 2023).

Code

library(dda)
library(tidyverse)
library(stars)

load("../../data/cpc/vnt_cpc_pro_stars.RData")
t0 <- as.Date("2016-01-01")
vnt_cpc_pro_stars |>
  filter(time %in% c(t0, t0 + 121, t0 + 244)) |>
  select(tmax) |>
  plot()

Figure 1: Daily maximum temperatures per province in Vietnam on several days of 2016.

Distributional data analysis

Maximum Penalised Likelihood density estimation (Silverman 1982)
The smooth densities are compositional splines and belong to the Bayes Hilbert space \(B^2 ([a,b])\) (Van Den Boogaart, Egozcue, and Pawlowsky-Glahn 2014).

Code

data(vietnam_temperature)
selected_years <- 2013:2016

vnt <- vietnam_temperature |>
  filter(year %in% selected_years) |>
  filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
  left_join(vietnam_regions, by = c("region" = "code")) %>%
  mutate(region = name) |>
  arrange(region, province) |>
  select(!c(t_min, name)) |>
  mutate(t_max = as_dd(t_max, lambda = 1, nbasis = 10, norder = 4))

class(vnt$t_max) <- c("ddl", "fdl", "list")

vnt |> plot_funs(t_max) +
  labs(x = "Daily max. temperature (°C)", y = "Density")

Figure 2: Yearly densities of daily max. temperature of some Vietnamese provinces between 2013 and 2016.

Coordinate-free ICS

Setting: \((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\)
Examples: \(\mathbb R^p\), square matrices, functions (polynomials, splines).
Aim: generalising (Tyler et al. 2009) where \(E=\mathbb R^p\)
See also (Li et al. 2021)

Example: ICS in \(E = \mathbb R^2\)

Generalised PCA: two ellipses are better than one

Code

plot_with_ellipses <- function(data, blue = FALSE, ...) {
  plot(
    data,
    asp = 1,
    ...
  )
  lines(
    ellipse::ellipse(cov(data), centre = colMeans(data), level = sqrt(3 / 4)),
    type = "l",
    lty = 2,
    col = "#0072b2"
  )

  arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
  e <- eigen(cov(data))
  arrows_stop <- arrows_start - t(e$vectors) * 2 *
    e$values**0.5
  if (blue) {
    arrows(arrows_start[, 1], arrows_start[, 2],
      arrows_stop[, 1], arrows_stop[, 2],
      col = "#0072b2"
    )
  }

  points(t(colMeans(data)), col = "#0072b2", pch = 3)

  lines(
    ellipse::ellipse(ICS::cov4(data),
      centre = colMeans(data),
      level = sqrt(3 / 4)
    ),
    type = "l",
    lty = 2,
    col = "#d55e00"
  )

  arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
  e <- eigen(ICS::cov4(data))
  arrows_stop <- arrows_start - t(e$vectors) * 2 *
    e$values**0.5
  arrows(arrows_start[, 1], arrows_start[, 2],
    arrows_stop[, 1], arrows_stop[, 2],
    col = "#d55e00"
  )
}

data(starsCYG, package = "robustbase")
x <- starsCYG
plot_with_ellipses(x, blue = TRUE)

Figure 3: The two starsCYG variables and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\). The \(\color{#0072b2}{blue}\) arrows are the main axes of the Principal Component Analysis.

Step 1: Sphering using \(\color{#0072b2}{\operatorname{Cov}}\)

Code

library(whitening)

y <- whiten(as.matrix(x), center = TRUE)
colnames(y) <- c("W.1", "W.2")
plot_with_ellipses(y)

Figure 4: The two starsCYG variables after whitening and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\).

Step 2: Rotating to diagonalise \(\color{#d55e00}{\operatorname{Cov}_4}\)

Code

icsobj <- ICS(x, center = TRUE)
z <- icsobj$scores
plot_with_ellipses(z)

Figure 5: The two starsCYG variables after rotating to diagonalise \(\color{#d55e00}{\operatorname{Cov}_4}\) and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\). We obtain the invariant coordinates (IC).

Back to the original data

Code

plot_with_ellipses(x, blue = TRUE)
arrows_start <- matrix(colMeans(x), 2, 2, byrow = TRUE)
e <- list(vectors = icsobj$W, values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * sign(diag(e$vectors)) * 0.5
arrows(arrows_start[, 1], arrows_start[, 2],
  arrows_stop[, 1], arrows_stop[, 2],
  col = "#009e73"
)

e <- list(vectors = solve(t(icsobj$W)), values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * 2 * sign(diag(e$vectors))
arrows(arrows_start[, 1], arrows_start[, 2],
  arrows_stop[, 1], arrows_stop[, 2],
  col = "#009e73",
  lty = "dotted"
)

Figure 6: The two original starsCYG variables and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\). In \(\color{#009e73}{green}\) is the basis given by ICS.

Invariant Coordinate Selection

Formalising the problem

Definition 1 (ICS problem)

\(\operatorname{ICS} (X, \color{#0072b2}{S_1},\color{#d55e00}{S_2})\): in \((E, \langle \cdot, \color{#0072b2}{S_1}[X] \cdot \rangle)\) Euclidean space of dimension \(p\), find an orthonormal basis \(\color{#009e73}{H} = (h_1, \dots, h_p)\) diagonalising the symmetric operator \(\color{#0072b2}{S_1}[X]^{-1} \color{#d55e00}{S_2}[X]\), i.e:

\[\left\{ \begin{aligned} \langle \color{#0072b2}{S_1} [X] h_i, h_j \rangle &= \delta_{ij} \\ \langle \color{#d55e00}{S_2}[X] h_i, h_j \rangle &= \lambda_i \delta_{ij} \end{aligned} \right. \text{ for all }1 \leq i,j \leq p\] where \(\lambda_1 \geq \dots \geq \lambda_p \geq 0\) are called generalised kurtosis.
\(z_i = \langle X - \mathbb EX, h_i \rangle, 1 \leq i \leq p\) are called invariant coordinates.

Proposition 1 (Reconstructing formula) \(X = \mathbb EX + \sum_{i=1}^p z_i h^*_i\) where \(\color{#009e73}{H^*} = (h^*_i)_{1 \leq i \leq p} = (\color{#0072b2}{S_1} [X] h_i)_{1 \leq i \leq p}\) is the dual basis of \(\color{#009e73}{H}\).

Weighted covariance operators

\(\operatorname{Cov}_w\) \(w\)-weighted covariance operator is defined for some r.v \(X \in E\) by: \[\forall (x,y) \in E^2, \langle \operatorname{Cov}_w [X] x, y \rangle = \mathbb E [ w^2(X) \langle X - \mathbb EX, x \rangle \langle X - \mathbb EX, y \rangle ] \] where \(w(X)\) is a random weight.
Examples:
- \(w(X) = 1\) defines \(\color{#0072b2}{\operatorname{Cov}}\).
- \(w(X) = \| \color{#0072b2}{\operatorname{Cov}} [X]^{-1/2} (X - \mathbb EX) \|\) defines \(\color{#d55e00}{\operatorname{Cov}_4}\).
If \(w\) is affine-invariant: \[\forall A \in \mathcal{GL} (E), \forall b \in E, w(AX+b) = w(X)\] then \(\operatorname{Cov}_w\) is an affine-equivariant scatter operator: \[\forall A \in \mathcal{GL} (E), \forall b \in E, \operatorname{Cov}_w [AX+b] = A \operatorname{Cov}_w [X] A^*\]

Implementation

Previous work

ICS

On multivariate data: \(E = \mathbb R^p\), with R packages:
- ICS
- ICSOutlier, ICSShiny: application to outlier detection
- ICSClust: application to clustering
On compositional data: \(E = \mathcal S^p\) (Ruiz-Gazen et al. 2023)
On multivariate functional data: \(\prod_i E_i, E_i \subseteq \mathcal L^2 ([a,b])\) (Archimbaud et al. 2022)

Distributional data

Density estimation: fda::density.fd computes B-splines coordinates of approximated solutions to the Maximum Penalised Likelihood problem.
CB and ZB-splines basis (Machalová et al. 2021).
Principal Component Analysis (Hron et al. 2016)

From \(E\) to \(\mathbb R^p\)

Let

\(B\) be any basis of \(E\);
\(G_B = (\langle b_i, b_j \rangle)_{1 \leq i,j \leq p}\) its Gram matrix;
\([\cdot]_B: E \rightarrow \mathbb R^p\) the coordinates in basis \(B\).

Proposition 2 For any basis \(H\) of \(E\), these are equivalent:


0.	\(H\)	solves	\(\operatorname{ICS} (X, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\)	over	\(E\)
1.	\(\color{#d55e00}{G_B^{1/2}} [H]_B\)	solves	\(\operatorname{ICS} (\color{#d55e00}{G_B^{1/2}} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\)	over	\(\mathbb R^p\)
2.	\([H]_B\)	solves	\(\operatorname{ICS} (\color{#d55e00}{G_B} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\)	over	\(\mathbb R^p\)
3.	\(\color{#d55e00}{G_B} [H]_B\)	solves	\(\operatorname{ICS} ([X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\)	over	\(\mathbb R^p\)

Proof

Proof. In ICS, the first step is to centre the data so wlog: \(\mathbb E [X] = 0\).
\(E\) is isometric to \(\mathbb R^p\) via the linear isomorphism: \[\phi_B: x \mapsto G_B^{1/2} [x]_B\] Then for any \((x,y) \in E^2\): \[ \begin{aligned} \langle \operatorname{Cov}_w [X] x, y \rangle_E &= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\ &= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\ &= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\ &= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\ &= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle \end{aligned} \]

Application

Temperature distributions in Vietnam

Dimension reduction

Code

library(dda)

t_max_ics <- ICS(vnt$t_max)

vnt |>
  cbind(as_tibble(t_max_ics$scores)) |>
  ggplot(aes(IC.1, IC.2,
    label = seq_len(nrow(vnt)), color = region, shape = NA
  )) +
  geom_point() +
  geom_text() +
  coord_fixed() +
  guides(shape = "none")

vnt |> plot_funs(t_max, color = region) +
  facet_wrap(vars(year)) +
  labs(x = "Max. temperature (°C)", y = "Density")

Figure 7: First two invariant coordinates coloured by region.

Figure 8: Density curves coloured by region.

Perspectives

Cross-validate the smoothing parameters
Continue developing the dda package and submit to CRAN
Generalise to Hilbert spaces

Thank you for listening!

Temperature distribution data: reducing the dimension with ICS
Pre-processing: ICS needs an inner product on a finite-dimensional space containing all the estimated densities
Coordinate-free ICS: defining the Invariant Coordinates without using a ‘canonical’ basis (change the inner product rather than transform the dataset)
Implementation: using coordinates in a CB-spline basis
Any questions?

References

Archimbaud A., Boulfani F., Gendre X., Nordhausen K., Ruiz-Gazen A. and Virta J. 2022. “ICS for multivariate functional anomaly detection with applications to predictive maintenance and quality control.” Econometrics and Statistics, March.

Hron K., Menafoglio A., Templ M., Hrůzová K. and Filzmoser P. 2016. “Simplicial principal component analysis for density functions in Bayes spaces.” Computational Statistics & Data Analysis 94 (February) : 330–50.

Li B., Van Bever G., Oja H., Sabolová R. and Critchley F. 2021. “Functional independent component analysis : An extension of fourth-order blind identification.”

Machalová J., Talská R., Hron K. and Gába A. 2021. “Compositional splines for representation of density functions.” Computational Statistics 36 (2) : 1031–64.

NOAA PSL. 1979. “CPC Global Unified Temperature.”

Ruiz-Gazen A., Thomas-Agnan C., Laurent T. and Mondon C. 2023. “Detecting Outliers in Compositional Data Using Invariant Coordinate Selection.” In Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler. edited by M. Yi and K. Nordhausen, 197–224. Cham : Springer International Publishing.

Silverman B. W. 1982. “On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method.” The Annals of Statistics 10 (3) : 795–810.

Trinh H. T., Simioni M. and Thomas-Agnan C. 2023. “Discrete and Smooth Scalar-on-Density Compositional Regression for Assessing the Impact of Climate Change on Rice Yield in Vietnam.” TSE Working Papers. Toulouse School of Economics (TSE).

Tyler D. E., Critchley F., Dümbgen L. and Oja H. 2009. “Invariant co-ordinate selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3) : 549–92.

Van Den Boogaart K. G., Egozcue J. J. and Pawlowsky-Glahn V. 2014. “Bayes Hilbert Spaces.” Australian & New Zealand Journal of Statistics 56 (2) : 171–94.

Appendix

Scatter operators

\((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\).
\(X \in \mathcal M (\Omega, E)\) random variable over \(E\).
\(\mathcal S^{+} (E)\) space of non-negative symmetric operators over \(E\).
\(\mathcal {GS}^{+} (E)\) space of positive symmetric operators over \(E\).
\(S: \mathcal A \subseteq \mathcal M (\Omega, E) \rightarrow \mathcal S^{+} (E)\) scatter operator over \(\mathcal A\) if:
- Invariance by equality in distribution: \[\forall (X,Y) \in \mathcal A^2, X \sim Y \Rightarrow S[X] = S[Y] \]
- It is affine equivariant if: \[\forall A \in \mathcal{GL} (E), \forall b \in E, S[AX+b] = A S[X] A^*\]

Proof of Proposition 1

Proof. Let us decompose \(S_1[X]^{-1} (X - \mathbb EX)\) over the basis \(H\), which is orthonormal in \((E, \langle \cdot, S_1[X] \cdot \rangle)\): \[\begin{aligned} S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_i \rangle h_i \\ &= \sum_{i=1}^p \langle X - \mathbb EX, h_i \rangle h_i \\ S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_i h_i \end{aligned}\] The dual basis \(H^*\) of \(H\) is the one that satisfies \(\langle h_i, h^*_j \rangle = \delta_{ij}\) for all \(1 \leq i,j \leq p\) and we know from the definition of ICS that this holds for \((S_1[X] h_i)_{1 \leq i \leq p}\).

Compositional ICS on histogram data

Code

library(tidyverse)
library(dda)
library(compositions)

data(vietnam_temperature)
data(vietnam_regions)
selected_years <- 2013:2016

hist_zeroreplace <- function(x) {
  N <- sum(x$counts)
  n <- length(x$counts)
  dl <- rep(1 / N, n) # Detection limit
  setNames(c(compositions::zeroreplace(x$density, dl)), x$mids)
}

vnth <- vietnam_temperature |>
  filter(year %in% selected_years) |>
  filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
  left_join(vietnam_regions, by = c("region" = "code")) %>%
  mutate(region = name) |>
  arrange(region, province) |>
  select(!c(t_min, name)) |>
  mutate(t_max = lapply(t_max, function(x) {
    hist(x,
      seq(min(unlist(t_max)), max(unlist(t_max)), length.out = 12),
      plot = FALSE
    )
  })) |>
  mutate(t_max = lapply(t_max, hist_zeroreplace)) |>
  unnest_wider(t_max, names_sep = ".")

t_max_ilr_ics <- vnth |>
  select(starts_with("t_max")) |>
  acomp() |>
  ilr() |>
  ICS(center = TRUE)

vnth |>
  cbind(as_tibble(t_max_ilr_ics$scores)) |>
  ggplot(aes(IC.1, IC.2,
    label = seq_len(nrow(vnth)), color = region, shape = NA
  )) +
  geom_point() +
  geom_text() +
  coord_fixed() +
  guides(shape = "none")

Figure 9: First two invariant coordinates coloured by region.

Bayes space and CB-splines

Source: Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).

\(P = \frac1{\lambda(I)} \lambda\) uniform probability measure over \(I=[a,b]\).
\(L^2_0 (P) = \{ f \in L^2 (P) | \int_I f \mathrm dP = 0 \}\) is a separable Hilbert space.
Centred log-ratio transform \(\operatorname{clr}: p \in \mathbb R^I \mapsto \log(p) - \int_I \log(p) \mathrm dP\) (if it exists)
Bayes space \(B^2 (P) = \{ \operatorname{clr}^{-1} (\{f\}), f \in L^2_0 (I) \}\) inherits the separable Hilbert space structure of \(L^2_0 (P)\). Each equivalence class contains a density function defined almost everywhere.
B-splines basis \(B=(B_i)_{-q \leq i \leq k}\) of \(\mathcal S_{q+1}^\Delta (I) \subset L^2 (P)\), the subspace of polynomial splines on \(I\) of order \(q+1\) and with knots \(\Delta\).
ZB-splines basis \(Z=(Z_i)_{-q+1 \leq i \leq k-1} = \left(\frac{\mathrm d B_i}{\mathrm dx}\right)_{-q+1 \leq i \leq k-1}\) of \(\mathcal Z_q^\Delta (I) = \mathcal S_q^\Delta (I) \cap L^2_0 (P)\)
CB-splines basis \(C = (C_i)_{-q+1 \leq i \leq k-1} = (\operatorname{clr}^{-1} (Z_i))_{-q+1 \leq i \leq k-1}\) of \(\mathcal C_q^\Delta (I)\) which is a Euclidean subspace of \(B^2 (P)\) of dimension \(p=k+q-1\).