Coordinate-free ICS for distributional and functional data

Doctoral Workshop on Decision Mathematics and Statistics

June 19, 2025

Outline

  • Theoretical aim: Generalising ICS to distributional and functional data.
  • Case study aim: Detecting meteorological stations with an outlying evolution of their annual temperature distribution.
Figure 1: Workflow diagram of distributional outlier detection using ICS.

Preprocessing

From temperature data…

Code
library(dda)
library(sf)
library(tidyverse)
set.seed(9)

load("data/france/occitanie.RData")
load("data/france/ocdepts.RData")

occlim <- unnest(occlim)

selected_times <- sample(unique(occlim$time), 3)

occlim_sample <- occlim |>
  filter(time %in% selected_times)

ggplot(ocdepts) +
  geom_sf(alpha = 0) +
  geom_sf(data = occlim_sample, aes(color = tx)) +
  scale_color_viridis_c() +
  facet_wrap(vars(time)) +
  labs(color = "Temperature")
Figure 2: Daily maximum temperature per meteorological station in Occitanie between 1950 and 2025. 3 days chosen at random are extracted from the data set.
  • Total size: 300 stations \(\times\) 65 years \(\times\) 365 days > 4 million observations.

…through distributional time series…

Code
oct |>
  filter(name == "TOULOUSE-BLAGNAC") |>
  plot_funs(tx, color = year) +
  facet_wrap(vars(10 * year %/% 10)) +
  labs(x = "Temperature", y = "Density")
Figure 3: Annual maximum temperature densities for the TOULOUSE-BLAGNAC station, grouped by decade.

…to trend slope densities

Interpretation: using odds ratios defined by Maier et al. (2025).

Code
oct_slope <- oct_slope |>
  mutate(tx = as_dd(tx)) |>
  st_join(ocdepts, suffix = c("", "_dept")) |>
  rename(dept = name_dept) |>
  select(id, name, dept, tx)

c(mean(oct_slope$tx)) |>
  plot() +
  geom_hline(
    yintercept = 1 / diff(oct_slope$tx[[1]]$basis$rangeval),
    linetype = "dashed"
  ) +
  geom_vline(
    xintercept = 15,
    linetype = "dashed"
  ) +
  labs(x = "Temperature", y = "Density")
Figure 4: Mean trend slope density across the 300 stations of Occitanie.

Data set of trend slope densities

Case study

Code
oct_slope |>
  plot_funs(tx) +
  labs(x = "Temperature", y = "Density")
Figure 5: Data set \(D\) of trend slope densities of temperature change for the 300 stations in Occitanie.

Theory

Density data

  • 11-dimensional space \(\mathcal C_q^\Delta (a,b)\) of compositional splines.
  • Subspace of the Bayes Hilbert space \((B^2 (a,b), \oplus, \odot, \langle \cdot, \cdot \rangle)\) (Van Den Boogaart, Egozcue, and Pawlowsky-Glahn 2014).
  • Inner product \[\langle f, g \rangle_{B^2 (a,b)} = \frac1{2(b-a)} \int_a^b \int_a^b \frac{\ln f(x)}{\ln f(y)} \frac{\ln g(x)}{\ln g(y)} dxdy\]

General framework

  • \((E, \langle \cdot, \cdot \rangle)\): \(p\)-dimensional Euclidean space.
  • Why a Euclidean space? To define scatter operators, an inner product is required to have a notion of symmetry and positiveness.
  • Let \(X\) be an \(E\)-valued random object, from which we observe a sample \(D = (X_1, \dots, X_n)\).

Scatter operators

Case study

Code
library(ICS)

oct_eval <- oct_slope |>
  mutate(tx = as.fd(tx)) |>
  eval_funs(tx, n = 100) |>
  st_drop_geometry() |>
  select(x, y, id) |>
  pivot_wider(names_from = id, values_from = y) |>
  column_to_rownames("x") |>
  t()

oct_cov <- cov(oct_eval)
eps <- 1e-8 * matrix(rnorm(ncol(oct_eval) * nrow(oct_eval), sd = max(abs(oct_eval))),
  nrow = nrow(oct_eval)
)
oct_cov4 <- cov4(oct_eval + eps)

oct_slope_scatters <- crossing(t_1 = colnames(oct_eval), t_2 = colnames(oct_eval)) |>
  rowwise() |>
  mutate(Cov = oct_cov[t_1, t_2], Cov_4 = oct_cov4[t_1, t_2]) |>
  pivot_longer(c(Cov, Cov_4), names_to = "scatter", values_to = "z") |>
  mutate(t_1 = as.numeric(t_1), t_2 = as.numeric(t_2))

tx_ics <- dda::ICS(oct_slope$tx)
oct_basis <- oct_slope$tx[[1]]$basis

changemat <- to_zbsplines(basis = oct_basis, inv = TRUE)
gram <- t(changemat) %*% fda::inprod(oct_basis, oct_basis) %*% changemat
tx_cov_zb <- cov(t(to_zbsplines(c(oct_slope$tx))))
tx_cov4_zb <- cov4(t(to_zbsplines(c(oct_slope$tx))))

# changemat_inv <- to_zbsplines(basis = oct_basis)
# tx_cov_b <- t(changemat_inv) %*% tx_cov_zb %*% changemat_inv
# tx_cov4_b <- t(changemat_inv) %*% tx_cov4_zb %*% changemat_inv

rangeval <- oct_slope$tx[[1]]$basis$rangeval
tval <- seq(rangeval[1], rangeval[2], length.out = 100)

zbsp <- fd(to_zbsplines(inv = TRUE, coefs = diag(11), basis = oct_basis), oct_basis)
zbsp_tval <- eval.fd(tval, zbsp)
tx_cov_grid <- zbsp_tval %*% tx_cov_zb %*% t(zbsp_tval)
tx_cov4_grid <- zbsp_tval %*% tx_cov4_zb %*% t(zbsp_tval)
dimnames(tx_cov_grid) <- list(tval, tval)
dimnames(tx_cov4_grid) <- list(tval, tval)

oct_slope_scatters_2 <- reshape2::melt(tx_cov_grid) |>
  rename(t_1 = Var1, t_2 = Var2, Cov = value) |>
  mutate(Cov_4 = reshape2::melt(tx_cov4_grid)$value) |>
  pivot_longer(c(Cov, Cov_4), names_to = "scatter", values_to = "z")

ggplot(oct_slope_scatters_2) +
  stat_contour_filled(aes(t_1, t_2, z = z)) +
  facet_wrap(vars(scatter), nrow = 2) +
  labs(x = "Temperature", y = "Temperature", fill = "Value")
Figure 6: Empirical estimates of the scatter operators \(\color{#0072b2}{\operatorname{Cov}} [X]\) and \(\color{#d55e00}{\operatorname{Cov}_4} [X]\) for the data set of slope densities of temperature change in Occitanie.

Theory

  • \(w\)-weighted covariance operator \[\begin{multline*}\langle \operatorname{Cov}_w [X] x, y \rangle \\ = \mathbb E [ w^2(\| \operatorname{Cov}[X]^{-1/2} (X - \mathbb E X) \|) \\ \langle X - \mathbb EX, x \rangle \langle X - \mathbb EX, y \rangle ] \end{multline*}\] for \((x,y) \in E\).

  • Examples.

    1. \(w = 1\) defines \(\color{#0072b2}{\operatorname{Cov}}\);
    2. \(w = \operatorname{id}\) defines \(\color{#d55e00}{\operatorname{Cov}_4}\).
  • Affine equivariant and defined coordinate-freely: no reference to any basis, exist on every Euclidean space.

Coordinate-free ICS

ICS problem

Theory

Definition 1 (ICS problem) \(\operatorname{ICS} (X, \color{#0072b2}{S_1},\color{#d55e00}{S_2})\): joint diagonalisation of \(\color{#0072b2}{S_1}[X]\) and \(\color{#d55e00}{S_2}[X]\), i.e:

\[\left\{ \begin{aligned} \langle \color{#0072b2}{S_1} [X] h_j, h_{j'} \rangle &= \delta_{jj'} \\ \langle \color{#d55e00}{S_2}[X] h_j, h_{j'} \rangle &= \lambda_j \delta_{jj'} \end{aligned} \right. \] for all \(1 \leq j,j' \leq p\) where

  • \(\lambda_1 \geq \dots \geq \lambda_p \geq 0\) are called ICS eigenvalues.
  • \((h_1, \dots, h_p)\) is an ICS eigenbasis.

Solutions exist provided that \(\color{#0072b2}{S_1} [X]\) is invertible.

Case study

Code
library(dda)

tx_ics <- dda::ICS(oct_slope$tx, )

tx_ics_eigen <- tibble(
  index = factor(names(tx_ics$gen_kurtosis),
    levels = names(tx_ics$gen_kurtosis)
  ),
  gen_kurtosis = tx_ics$gen_kurtosis,
  H_dual = as.list(tx_ics$H_dual),
)

tx_ics_eigen |>
  ggplot(aes(index, gen_kurtosis, group = 1)) +
  geom_line(alpha = 0.5) +
  geom_point(aes(color = gen_kurtosis), size = 3) +
  labs(x = "", y = "ICS eigenvalues") +
  guides(color = "none")
Figure 7: Scree plot of ICS eigenvalues for the data set of slopes densities in Occitanie.
Code
tx_ics_eigen |>
  mutate(H_dual = as_dd(H_dual)) |>
  plot_funs(H_dual, color = gen_kurtosis) +
  facet_wrap(vars(index)) +
  geom_hline(
    yintercept = 1 / diff(oct_slope$tx[[1]]$basis$rangeval),
    linetype = "dotted"
  ) +
  scale_y_continuous(n.breaks = 3) +
  labs(x = "Temperature", y = "Density") +
  guides(color = "none")
Figure 8: ICS eigendensities for the data set of slope densities in Occitanie.

Invariant coordinates

Case study

Code
oct_slope |>
  cbind(as_tibble(tx_ics$scores)) |>
  ggplot(aes(IC.1, IC.2, color = dept)) +
  geom_point() +
  coord_fixed() +
  labs(color = "Department")
Figure 9: First two invariant coordinates coloured by department.

Theory

For \(1 \leq j \leq p\): \[z_j = \langle X - \mathbb EX, h_j \rangle\] is the \(i\)-th invariant coordinate of \(X\).

Reconstructing the curves

Case study

Code
h_dual_shift <- tx_ics$H_dual
h_dual_shift$coefs <- sweep(h_dual_shift$coefs, 1, -c(mean(oct_slope$tx))$coefs)
tx_ics_eigen |>
  mutate(H_dual_shift = as.list(as_dd(h_dual_shift))) |>
  plot_funs(H_dual_shift, color = gen_kurtosis) +
  geom_function(
    fun = function(x) eval(c(mean(oct_slope$tx)), x),
    color = "black", linetype = "dotted"
  ) +
  scale_y_continuous(n.breaks = 3) +
  facet_wrap(vars(index)) +
  labs(x = "Temperature", y = "Density") +
  guides(color = "none")
Figure 10: Mean slope density across Occitanie, shifted by each ICS dual eigendensity.

Theory

Proposition 1 (Reconstructing formula) \[X = \mathbb EX + \sum_{j=1}^p z_j h^*_j\] where \[\begin{aligned} H^* &= (h^*_j)_{1 \leq j \leq p} \\ &= (\color{#0072b2}{S_1} [X] h_j)_{1 \leq j \leq p} \end{aligned}\] is the dual basis of \(H=(h_j)_{1 \leq j \leq p}\).

Implementation

Let

  • \(B\) be any basis of \(E\);
  • \(G_B = (\langle b_i, b_j \rangle)_{1 \leq i,j \leq p}\) its Gram matrix;
  • \([\cdot]_B: E \rightarrow \mathbb R^p\) the coordinates in basis \(B\).

Corollary 1 (of Proposition 2) For any basis \(H\) of \(E\), these are equivalent:

0. \(H\) solves \(\operatorname{ICS} (X, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(E\)
1. \(\color{#d55e00}{G_B^{1/2}} [H]_B\) solves \(\operatorname{ICS} (\color{#d55e00}{G_B^{1/2}} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)
2. \([H]_B\) solves \(\operatorname{ICS} (\color{#d55e00}{G_B} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)
3. \(\color{#d55e00}{G_B} [H]_B\) solves \(\operatorname{ICS} ([X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)

Outlier detection using ICS

Theory

Based on ICSOutlier (Archimbaud, Nordhausen, and Ruiz-Gazen 2018a).

  1. Compute the invariant coordinates.
  1. Select \(k\) components. Using the scree plot or D’Agostino normality tests, etc. Now compute the ICS distances: \(\sum_{j=1}^k |z_j|^2\).
  1. Choose a cut-off \(\tau\). Based on Monte Carlo simulations to compute quantiles (Archimbaud, Nordhausen, and Ruiz-Gazen 2018b).

Case study

Code
tx_out <- dda::ICS_outlier(oct_slope$tx, index = 1)

oct_slope <- oct_slope |>
  mutate(
    distance = tx_out$ics_distances,
    outlying = tx_out$outliers == 1
  )

oct_slope |>
  mutate(name = ifelse(outlying, name, NA)) |>
  ggplot(aes(id, distance, color = dept, label = name)) +
  geom_point() +
  geom_text() +
  scale_color_discrete(limits = function(x) ifelse(is.na(x), "", x)) +
  labs(color = "Outlying\nstations") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank()
  )
Figure 11: ICS distances for \(k=1\) selected component.
Code
oct_slope |>
  mutate(name = ifelse(outlying, name, NA)) |>
  ggplot(aes(id, distance, color = dept, label = name)) +
  geom_point() +
  geom_text() +
  scale_color_discrete(limits = function(x) ifelse(is.na(x), "", x)) +
  geom_hline(yintercept = tx_out$ics_dist_cutoff) +
  labs(color = "Outlying\nstations") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank()
  )
Figure 12: ICS distances for k=1 selected component, with threshold 5.5.

Results for the case study

Code
oct_slope |>
  filter(outlying) |>
  plot_funs(tx, color = name) +
  geom_hline(
    yintercept = 1 / diff(oct_slope$tx[[1]]$basis$rangeval),
    linetype = "dotted"
  ) +
  geom_function(
    fun = function(x) eval(mean(c(oct_slope$tx)), x),
    color = "black"
  ) +
  labs(x = "Temperature", y = "Density") +
  guides(color = "none")

ggplot(ocdepts) +
  geom_sf(alpha = 0) +
  geom_sf(data = filter(oct_slope, outlying), aes(color = name)) +
  geom_sf_text(data = filter(oct_slope, outlying), aes(color = name, label = name)) +
  labs(x = "", y = "", color = "Station")
Figure 13: Outlying stations for the data set of slope densities in Occitanie.
Figure 14: Outlying stations on a map of Occitanie.

Comparison with other methods

Monte Carlo simulations

  1. Generate density data with outliers according to 3 different outlier schemes.

  2. (Optionally) apply a transformation:

  • log quantile density
  • quantile function

then choose a metric:

  • \(L^2 (a,b)\)
  • \(B^2 (a,b)\)
  • Wasserstein.
  1. Compute scores for 4 different functional outlier detection methods adapted to density data:

ROC curves

Code
load("data/ICS_comparison.RData")

# Aggregate results (averaging over simulations)
comparison_avg <- comparison |>
  separate_wider_delim(Method, "_", names = c("Method", "Transf")) |>
  group_by(Scheme, Transf, Method, PP) |>
  summarize(
    n = n(),
    TPR_avg = mean(TPR),
    TPR_conf = qnorm(0.975) * sd(TPR) / sqrt(n()),
    FPR_avg = mean(FPR),
    FPR_conf = qnorm(0.975) * sd(FPR) / sqrt(n()),
    .groups = "drop"
  )

ggplot(comparison_avg, aes(FPR_avg, TPR_avg, color = Scheme, group = Scheme)) +
  geom_line() +
  geom_ribbon(aes(
    ymin = pmax(0, TPR_avg - TPR_conf),
    ymax = pmin(1, TPR_avg + TPR_conf),
    xmin = pmax(0, FPR_avg - FPR_conf),
    xmax = pmin(1, FPR_avg + FPR_conf),
    fill = Scheme
  ), alpha = 0.1, linetype = "dotted") +
  geom_function(fun = identity, linetype = "dotted") +
  facet_wrap(vars(paste(Method, Transf, sep = "_")), nrow = 2) +
  coord_fixed() +
  labs(
    x = "False Positive Rate (FPR)",
    y = "True Positive Rate (TPR)",
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
Figure 15: Comparing ICS with other distributional outlier detection methods through simulated ROC curves for several outlier generating schemes.

Area under the curve (AUC)

Code
comparison_auc <- comparison_avg |>
  group_by(Scheme, Method, Transf) |>
  summarise(AUC = integrate(approxfun(FPR_avg, TPR_avg), 0, 1)$value)

comparison_auc_summary <- comparison_auc |>
  group_by(Method, Transf) |>
  summarise(AUC_avg = mean(AUC)) |>
  arrange(desc(AUC_avg))

ggplot(comparison_auc_summary, aes(reorder(paste(Method, Transf, sep = "\n"), -AUC_avg), AUC_avg, fill = Method)) +
  geom_col() +
  coord_cartesian(ylim = c(0.5, NA)) +
  labs(x = "")
Figure 16: Area under the ROC curve for each outlier detection method.

Thank you for listening!

For more details: (mondon_ics_2025?).

Any questions?

References

Archimbaud A., Nordhausen K. and Ruiz-Gazen A. 2018a. ICSOutlier: Unsupervised Outlier Detection for Low- Dimensional Contamination Structure.” The R Journal 10 (1) : 234–50.
———. 2018b. ICS for multivariate outlier detection with application to quality control.” Computational Statistics & Data Analysis 128 (December) : 184–99.
Machalová J., Talská R., Hron K. and Gába A. 2021. Compositional splines for representation of density functions.” Computational Statistics 36 (2) : 1031–64.
Maier E.-M., Stöcker A., Fitzenberger B. and Greven S. 2025. Additive density-on-scalar regression in Bayes Hilbert spaces with an application to gender economics.” The Annals of Applied Statistics 19 (1) : 680–700.
Murph A. C., Strait J. D., Moran K. R., Hyman J. D. and Stauffer P. H. 2024. Visualisation and outlier detection for probability density function ensembles.” Stat 13 (2) : e662.
Ojo O. T., Fernández Anta A., Lillo R. E. and Sguera C. 2022. Detecting and classifying outliers in big functional data.” Advances in Data Analysis and Classification 16 (3) : 725–60.
Sun Y. and Genton M. G. 2011. Functional Boxplots.” Journal of Computational and Graphical Statistics 20 (2) : 316–34.
Van Den Boogaart K. G., Egozcue J. J. and Pawlowsky-Glahn V. 2014. Bayes Hilbert Spaces.” Australian & New Zealand Journal of Statistics 56 (2) : 171–94.

Appendix

Scatter operators

  • \((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\).

  • \(X \in \mathcal M (\Omega, E)\) random variable over \(E\).

  • \(\mathcal S^{+} (E)\) space of non-negative symmetric operators over \(E\).

  • \(\mathcal {GS}^{+} (E)\) space of positive symmetric operators over \(E\).

  • \(S: \mathcal A \subseteq \mathcal M (\Omega, E) \rightarrow \mathcal S^{+} (E)\) scatter operator over \(\mathcal A\) if:

    • Invariance by equality in distribution: \[\forall (X,Y) \in \mathcal A^2, X \sim Y \Rightarrow S[X] = S[Y] \]
    • It is affine equivariant if: \[\forall A \in \mathcal{GL} (E), \forall b \in E, S[AX+b] = A S[X] A^*\]

Proof of Proposition 1

Proof.

Let us decompose \(S_1[X]^{-1} (X - \mathbb EX)\) over the basis \(H\), which is orthonormal in \((E, \langle \cdot, S_1[X] \cdot \rangle)\): \[\begin{aligned} S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_j \rangle h_j \\ &= \sum_{i=1}^p \langle X - \mathbb EX, h_j \rangle h_j \\ S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_j h_j \end{aligned}\]

The dual basis \(H^*\) of \(H\) is the one that satisfies \(\langle h_j, h^*_j \rangle = \delta_{ij}\) for all \(1 \leq i,j \leq p\) and we know from the definition of ICS that this holds for \((S_1[X] h_j)_{1 \leq j \leq p}\).

Proof of Corollary 1

Proof. In ICS, the first step is to centre the data so wlog: \(\mathbb E [X] = 0\).
\(E\) is isometric to \(\mathbb R^p\) via the linear isomorphism: \[\phi_B: x \mapsto G_B^{1/2} [x]_B\] Then for any \((x,y) \in E^2\): \[ \begin{aligned} \langle \operatorname{Cov}_w [X] x, y \rangle_E &= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\ &= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\ &= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\ &= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\ &= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle \end{aligned} \]

Bayes space and CB-splines

Source: Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).

  • \(P = \frac1{\lambda(I)} \lambda\) uniform probability measure over \(I=[a,b]\).
  • \(L^2_0 (P) = \{ f \in L^2 (P) | \int_I f \mathrm dP = 0 \}\) is a separable Hilbert space.
  • Centred log-ratio transform \(\operatorname{clr}: p \in \mathbb R^I \mapsto \log(p) - \int_I \log(p) \mathrm dP\) (if it exists)
  • Bayes space \(B^2 (P) = \{ \operatorname{clr}^{-1} (\{f\}), f \in L^2_0 (I) \}\) inherits the separable Hilbert space structure of \(L^2_0 (P)\). Each equivalence class contains a density function defined almost everywhere.
  • B-splines basis \(B=(B_i)_{-q \leq i \leq k}\) of \(\mathcal S_{q+1}^\Delta (I) \subset L^2 (P)\), the subspace of polynomial splines on \(I\) of order \(q+1\) and with knots \(\Delta\).
  • ZB-splines basis \(Z=(Z_i)_{-q+1 \leq i \leq k-1} = \left(\frac{\mathrm d B_i}{\mathrm dx}\right)_{-q+1 \leq i \leq k-1}\) of \(\mathcal Z_q^\Delta (I) = \mathcal S_q^\Delta (I) \cap L^2_0 (P)\)
  • CB-splines basis \(C = (C_i)_{-q+1 \leq i \leq k-1} = (\operatorname{clr}^{-1} (Z_i))_{-q+1 \leq i \leq k-1}\) of \(\mathcal C_q^\Delta (I)\) which is a Euclidean subspace of \(B^2 (P)\) of dimension \(p=k+q-1\).