Dimension Reduction on Distributional Data

55ᵉ Journées de Statistique, Université de Bordeaux

May 30, 2024

Outline

  1. Introduction and distributional data
  2. Coordinate-free ICS (Invariant Coordinate Selection)
  3. Implementation
  4. Application

Introduction

Aim: studying climate in Vietnam

Source: CPC Global Unified Temperature (NOAA PSL 1979), see also (Trinh, Simioni, and Thomas-Agnan 2023).

Code
library(dda)
library(tidyverse)
library(stars)

load("../../data/cpc/vnt_cpc_pro_stars.RData")
t0 <- as.Date("2016-01-01")
vnt_cpc_pro_stars |>
  filter(time %in% c(t0, t0 + 121, t0 + 244)) |>
  select(tmax) |>
  plot()

Figure 1: Daily maximum temperatures per province in Vietnam on several days of 2016.

Distributional data analysis

Code
data(vietnam_temperature)
selected_years <- 2013:2016

vnt <- vietnam_temperature |>
  filter(year %in% selected_years) |>
  filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
  left_join(vietnam_regions, by = c("region" = "code")) %>%
  mutate(region = name) |>
  arrange(region, province) |>
  select(!c(t_min, name)) |>
  mutate(t_max = as_dd(t_max, lambda = 1, nbasis = 10, norder = 4))

class(vnt$t_max) <- c("ddl", "fdl", "list")

vnt |> plot_funs(t_max) +
  labs(x = "Daily max. temperature (°C)", y = "Density")

Figure 2: Yearly densities of daily max. temperature of some Vietnamese provinces between 2013 and 2016.

Coordinate-free ICS

  • Setting: \((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\)
    Examples: \(\mathbb R^p\), square matrices, functions (polynomials, splines).
  • Aim: generalising (Tyler et al. 2009) where \(E=\mathbb R^p\)
  • See also (Li et al. 2021)

Example: ICS in \(E = \mathbb R^2\)

Generalised PCA: two ellipses are better than one

Code
plot_with_ellipses <- function(data, blue = FALSE, ...) {
  plot(
    data,
    asp = 1,
    ...
  )
  lines(
    ellipse::ellipse(cov(data), centre = colMeans(data), level = sqrt(3 / 4)),
    type = "l",
    lty = 2,
    col = "#0072b2"
  )

  arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
  e <- eigen(cov(data))
  arrows_stop <- arrows_start - t(e$vectors) * 2 *
    e$values**0.5
  if (blue) {
    arrows(arrows_start[, 1], arrows_start[, 2],
      arrows_stop[, 1], arrows_stop[, 2],
      col = "#0072b2"
    )
  }

  points(t(colMeans(data)), col = "#0072b2", pch = 3)

  lines(
    ellipse::ellipse(ICS::cov4(data),
      centre = colMeans(data),
      level = sqrt(3 / 4)
    ),
    type = "l",
    lty = 2,
    col = "#d55e00"
  )

  arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
  e <- eigen(ICS::cov4(data))
  arrows_stop <- arrows_start - t(e$vectors) * 2 *
    e$values**0.5
  arrows(arrows_start[, 1], arrows_start[, 2],
    arrows_stop[, 1], arrows_stop[, 2],
    col = "#d55e00"
  )
}

data(starsCYG, package = "robustbase")
x <- starsCYG
plot_with_ellipses(x, blue = TRUE)

Figure 3: The two starsCYG variables and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\). The \(\color{#0072b2}{blue}\) arrows are the main axes of the Principal Component Analysis.

Step 1: Sphering using \(\color{#0072b2}{\operatorname{Cov}}\)

Code
library(whitening)

y <- whiten(as.matrix(x), center = TRUE)
colnames(y) <- c("W.1", "W.2")
plot_with_ellipses(y)

Figure 4: The two starsCYG variables after whitening and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\).

Step 2: Rotating to diagonalise \(\color{#d55e00}{\operatorname{Cov}_4}\)

Code
icsobj <- ICS(x, center = TRUE)
z <- icsobj$scores
plot_with_ellipses(z)

Figure 5: The two starsCYG variables after rotating to diagonalise \(\color{#d55e00}{\operatorname{Cov}_4}\) and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\). We obtain the invariant coordinates (IC).

Back to the original data

Code
plot_with_ellipses(x, blue = TRUE)
arrows_start <- matrix(colMeans(x), 2, 2, byrow = TRUE)
e <- list(vectors = icsobj$W, values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * sign(diag(e$vectors)) * 0.5
arrows(arrows_start[, 1], arrows_start[, 2],
  arrows_stop[, 1], arrows_stop[, 2],
  col = "#009e73"
)

e <- list(vectors = solve(t(icsobj$W)), values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * 2 * sign(diag(e$vectors))
arrows(arrows_start[, 1], arrows_start[, 2],
  arrows_stop[, 1], arrows_stop[, 2],
  col = "#009e73",
  lty = "dotted"
)

Figure 6: The two original starsCYG variables and the ellipses associated with \(\color{#0072b2}{\operatorname{Cov}}\) and \(\color{#d55e00}{\operatorname{Cov}_4}\). In \(\color{#009e73}{green}\) is the basis given by ICS.

Invariant Coordinate Selection

Formalising the problem

Definition 1 (ICS problem)  

  • \(\operatorname{ICS} (X, \color{#0072b2}{S_1},\color{#d55e00}{S_2})\): in \((E, \langle \cdot, \color{#0072b2}{S_1}[X] \cdot \rangle)\) Euclidean space of dimension \(p\), find an orthonormal basis \(\color{#009e73}{H} = (h_1, \dots, h_p)\) diagonalising the symmetric operator \(\color{#0072b2}{S_1}[X]^{-1} \color{#d55e00}{S_2}[X]\), i.e:

    \[\left\{ \begin{aligned} \langle \color{#0072b2}{S_1} [X] h_i, h_j \rangle &= \delta_{ij} \\ \langle \color{#d55e00}{S_2}[X] h_i, h_j \rangle &= \lambda_i \delta_{ij} \end{aligned} \right. \text{ for all }1 \leq i,j \leq p\] where \(\lambda_1 \geq \dots \geq \lambda_p \geq 0\) are called generalised kurtosis.

  • \(z_i = \langle X - \mathbb EX, h_i \rangle, 1 \leq i \leq p\) are called invariant coordinates.

Proposition 1 (Reconstructing formula) \(X = \mathbb EX + \sum_{i=1}^p z_i h^*_i\) where \(\color{#009e73}{H^*} = (h^*_i)_{1 \leq i \leq p} = (\color{#0072b2}{S_1} [X] h_i)_{1 \leq i \leq p}\) is the dual basis of \(\color{#009e73}{H}\).

Weighted covariance operators

  • \(\operatorname{Cov}_w\) \(w\)-weighted covariance operator is defined for some r.v \(X \in E\) by: \[\forall (x,y) \in E^2, \langle \operatorname{Cov}_w [X] x, y \rangle = \mathbb E [ w^2(X) \langle X - \mathbb EX, x \rangle \langle X - \mathbb EX, y \rangle ] \] where \(w(X)\) is a random weight.
  • Examples:
    • \(w(X) = 1\) defines \(\color{#0072b2}{\operatorname{Cov}}\).
    • \(w(X) = \| \color{#0072b2}{\operatorname{Cov}} [X]^{-1/2} (X - \mathbb EX) \|\) defines \(\color{#d55e00}{\operatorname{Cov}_4}\).
  • If \(w\) is affine-invariant: \[\forall A \in \mathcal{GL} (E), \forall b \in E, w(AX+b) = w(X)\] then \(\operatorname{Cov}_w\) is an affine-equivariant scatter operator: \[\forall A \in \mathcal{GL} (E), \forall b \in E, \operatorname{Cov}_w [AX+b] = A \operatorname{Cov}_w [X] A^*\]

Implementation

Previous work

ICS

  • On multivariate data: \(E = \mathbb R^p\), with R packages:
    • ICS
    • ICSOutlier, ICSShiny: application to outlier detection
    • ICSClust: application to clustering
  • On compositional data: \(E = \mathcal S^p\) (Ruiz-Gazen et al. 2023)
  • On multivariate functional data: \(\prod_i E_i, E_i \subseteq \mathcal L^2 ([a,b])\) (Archimbaud et al. 2022)

Distributional data

  • Density estimation: fda::density.fd computes B-splines coordinates of approximated solutions to the Maximum Penalised Likelihood problem.
  • CB and ZB-splines basis (Machalová et al. 2021).
  • Principal Component Analysis (Hron et al. 2016)

From \(E\) to \(\mathbb R^p\)

Let

  • \(B\) be any basis of \(E\);
  • \(G_B = (\langle b_i, b_j \rangle)_{1 \leq i,j \leq p}\) its Gram matrix;
  • \([\cdot]_B: E \rightarrow \mathbb R^p\) the coordinates in basis \(B\).

Proposition 2 For any basis \(H\) of \(E\), these are equivalent:

0. \(H\) solves \(\operatorname{ICS} (X, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(E\)
1. \(\color{#d55e00}{G_B^{1/2}} [H]_B\) solves \(\operatorname{ICS} (\color{#d55e00}{G_B^{1/2}} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)
2. \([H]_B\) solves \(\operatorname{ICS} (\color{#d55e00}{G_B} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)
3. \(\color{#d55e00}{G_B} [H]_B\) solves \(\operatorname{ICS} ([X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)

Proof

Proof. In ICS, the first step is to centre the data so wlog: \(\mathbb E [X] = 0\).
\(E\) is isometric to \(\mathbb R^p\) via the linear isomorphism: \[\phi_B: x \mapsto G_B^{1/2} [x]_B\] Then for any \((x,y) \in E^2\): \[ \begin{aligned} \langle \operatorname{Cov}_w [X] x, y \rangle_E &= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\ &= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\ &= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\ &= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\ &= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle \end{aligned} \]

Application

Temperature distributions in Vietnam

Dimension reduction

Code
library(dda)

t_max_ics <- ICS(vnt$t_max)

vnt |>
  cbind(as_tibble(t_max_ics$scores)) |>
  ggplot(aes(IC.1, IC.2,
    label = seq_len(nrow(vnt)), color = region, shape = NA
  )) +
  geom_point() +
  geom_text() +
  coord_fixed() +
  guides(shape = "none")

vnt |> plot_funs(t_max, color = region) +
  facet_wrap(vars(year)) +
  labs(x = "Max. temperature (°C)", y = "Density")
Figure 7: First two invariant coordinates coloured by region.
Figure 8: Density curves coloured by region.

Perspectives

  • Cross-validate the smoothing parameters
  • Continue developing the dda package and submit to CRAN
  • Generalise to Hilbert spaces

Thank you for listening!

References

Archimbaud A., Boulfani F., Gendre X., Nordhausen K., Ruiz-Gazen A. and Virta J. 2022. ICS for multivariate functional anomaly detection with applications to predictive maintenance and quality control.” Econometrics and Statistics, March.
Hron K., Menafoglio A., Templ M., Hrůzová K. and Filzmoser P. 2016. Simplicial principal component analysis for density functions in Bayes spaces.” Computational Statistics & Data Analysis 94 (February) : 330–50.
Li B., Van Bever G., Oja H., Sabolová R. and Critchley F. 2021. Functional independent component analysis : An extension of fourth-order blind identification.”
Machalová J., Talská R., Hron K. and Gába A. 2021. Compositional splines for representation of density functions.” Computational Statistics 36 (2) : 1031–64.
NOAA PSL. 1979. CPC Global Unified Temperature.”
Ruiz-Gazen A., Thomas-Agnan C., Laurent T. and Mondon C. 2023. Detecting Outliers in Compositional Data Using Invariant Coordinate Selection.” In Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler. edited by M. Yi and K. Nordhausen, 197–224. Cham : Springer International Publishing.
Silverman B. W. 1982. On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method.” The Annals of Statistics 10 (3) : 795–810.
Trinh H. T., Simioni M. and Thomas-Agnan C. 2023. Discrete and Smooth Scalar-on-Density Compositional Regression for Assessing the Impact of Climate Change on Rice Yield in Vietnam.” TSE Working Papers. Toulouse School of Economics (TSE).
Tyler D. E., Critchley F., Dümbgen L. and Oja H. 2009. Invariant co-ordinate selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3) : 549–92.
Van Den Boogaart K. G., Egozcue J. J. and Pawlowsky-Glahn V. 2014. Bayes Hilbert Spaces.” Australian & New Zealand Journal of Statistics 56 (2) : 171–94.

Appendix

Scatter operators

  • \((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\).

  • \(X \in \mathcal M (\Omega, E)\) random variable over \(E\).

  • \(\mathcal S^{+} (E)\) space of non-negative symmetric operators over \(E\).

  • \(\mathcal {GS}^{+} (E)\) space of positive symmetric operators over \(E\).

  • \(S: \mathcal A \subseteq \mathcal M (\Omega, E) \rightarrow \mathcal S^{+} (E)\) scatter operator over \(\mathcal A\) if:

    • Invariance by equality in distribution: \[\forall (X,Y) \in \mathcal A^2, X \sim Y \Rightarrow S[X] = S[Y] \]
    • It is affine equivariant if: \[\forall A \in \mathcal{GL} (E), \forall b \in E, S[AX+b] = A S[X] A^*\]

Proof of Proposition 1

Proof. Let us decompose \(S_1[X]^{-1} (X - \mathbb EX)\) over the basis \(H\), which is orthonormal in \((E, \langle \cdot, S_1[X] \cdot \rangle)\): \[\begin{aligned} S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_i \rangle h_i \\ &= \sum_{i=1}^p \langle X - \mathbb EX, h_i \rangle h_i \\ S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_i h_i \end{aligned}\] The dual basis \(H^*\) of \(H\) is the one that satisfies \(\langle h_i, h^*_j \rangle = \delta_{ij}\) for all \(1 \leq i,j \leq p\) and we know from the definition of ICS that this holds for \((S_1[X] h_i)_{1 \leq i \leq p}\).

Compositional ICS on histogram data

Code
library(tidyverse)
library(dda)
library(compositions)

data(vietnam_temperature)
data(vietnam_regions)
selected_years <- 2013:2016

hist_zeroreplace <- function(x) {
  N <- sum(x$counts)
  n <- length(x$counts)
  dl <- rep(1 / N, n) # Detection limit
  setNames(c(compositions::zeroreplace(x$density, dl)), x$mids)
}

vnth <- vietnam_temperature |>
  filter(year %in% selected_years) |>
  filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
  left_join(vietnam_regions, by = c("region" = "code")) %>%
  mutate(region = name) |>
  arrange(region, province) |>
  select(!c(t_min, name)) |>
  mutate(t_max = lapply(t_max, function(x) {
    hist(x,
      seq(min(unlist(t_max)), max(unlist(t_max)), length.out = 12),
      plot = FALSE
    )
  })) |>
  mutate(t_max = lapply(t_max, hist_zeroreplace)) |>
  unnest_wider(t_max, names_sep = ".")

t_max_ilr_ics <- vnth |>
  select(starts_with("t_max")) |>
  acomp() |>
  ilr() |>
  ICS(center = TRUE)

vnth |>
  cbind(as_tibble(t_max_ilr_ics$scores)) |>
  ggplot(aes(IC.1, IC.2,
    label = seq_len(nrow(vnth)), color = region, shape = NA
  )) +
  geom_point() +
  geom_text() +
  coord_fixed() +
  guides(shape = "none")

Figure 9: First two invariant coordinates coloured by region.

Bayes space and CB-splines

Source: Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).

  • \(P = \frac1{\lambda(I)} \lambda\) uniform probability measure over \(I=[a,b]\).
  • \(L^2_0 (P) = \{ f \in L^2 (P) | \int_I f \mathrm dP = 0 \}\) is a separable Hilbert space.
  • Centred log-ratio transform \(\operatorname{clr}: p \in \mathbb R^I \mapsto \log(p) - \int_I \log(p) \mathrm dP\) (if it exists)
  • Bayes space \(B^2 (P) = \{ \operatorname{clr}^{-1} (\{f\}), f \in L^2_0 (I) \}\) inherits the separable Hilbert space structure of \(L^2_0 (P)\). Each equivalence class contains a density function defined almost everywhere.
  • B-splines basis \(B=(B_i)_{-q \leq i \leq k}\) of \(\mathcal S_{q+1}^\Delta (I) \subset L^2 (P)\), the subspace of polynomial splines on \(I\) of order \(q+1\) and with knots \(\Delta\).
  • ZB-splines basis \(Z=(Z_i)_{-q+1 \leq i \leq k-1} = \left(\frac{\mathrm d B_i}{\mathrm dx}\right)_{-q+1 \leq i \leq k-1}\) of \(\mathcal Z_q^\Delta (I) = \mathcal S_q^\Delta (I) \cap L^2_0 (P)\)
  • CB-splines basis \(C = (C_i)_{-q+1 \leq i \leq k-1} = (\operatorname{clr}^{-1} (Z_i))_{-q+1 \leq i \leq k-1}\) of \(\mathcal C_q^\Delta (I)\) which is a Euclidean subspace of \(B^2 (P)\) of dimension \(p=k+q-1\).