Coordinate-free ICS for distributional and functional data

Doctoral Workshop on Decision Mathematics and Statistics

June 19, 2025

Outline

  • Theoretical aim: Generalising ICS to distributional and functional data.
  • Case study aim: Detecting meteorological stations with an outlying evolution of their annual temperature distribution.
Figure 1: Workflow diagram of distributional outlier detection using ICS.

Preprocessing

From temperature data…

Code
library(dda)
library(sf)
library(tidyverse)
set.seed(9)

load("data/france/occitanie.RData")
load("data/france/ocdepts.RData")

occlim <- unnest(occlim)

selected_times <- sample(unique(occlim$time), 3)

occlim_sample <- occlim |>
  filter(time %in% selected_times)

ggplot(ocdepts) +
  geom_sf(alpha = 0) +
  geom_sf(data = occlim_sample, aes(color = tx)) +
  scale_color_viridis_c() +
  facet_wrap(vars(time)) +
  labs(color = "Temperature")
Figure 2: Daily maximum temperature per meteorological station in Occitanie between 1950 and 2025. 3 days chosen at random are extracted from the data set.
  • Total size: 300 stations \(\times\) 65 years \(\times\) 365 days > 4 million observations.

…through distributional time series…

Code
oct |>
  filter(name == "TOULOUSE-BLAGNAC") |>
  plot_funs(tx, color = year) +
  facet_wrap(vars(10 * year %/% 10)) +
  labs(x = "Temperature", y = "Density")
Figure 3: Annual maximum temperature densities for the TOULOUSE-BLAGNAC station, grouped by decade.

…to trend slope densities

Interpretation: using odds ratios defined by Maier et al. (2025).

Code
oct_slope <- oct_slope |>
  mutate(tx = as_dd(tx)) |>
  st_join(ocdepts, suffix = c("", "_dept")) |>
  rename(dept = name_dept) |>
  select(id, name, dept, tx)

c(mean(oct_slope$tx)) |>
  plot() +
  geom_hline(
    yintercept = 1 / diff(oct_slope$tx[[1]]$basis$rangeval),
    linetype = "dashed"
  ) +
  geom_vline(
    xintercept = 15,
    linetype = "dashed"
  ) +
  labs(x = "Temperature", y = "Density")
Figure 4: Mean trend slope density across the 300 stations of Occitanie.

Data set of trend slope densities

Case study

Code
oct_slope |>
  plot_funs(tx) +
  labs(x = "Temperature", y = "Density")
Figure 5: Data set \(D\) of trend slope densities of temperature change for the 300 stations in Occitanie.

Theory

Density data

  • 11-dimensional space \(\mathcal C_q^\Delta (a,b)\) of compositional splines.
  • Subspace of the Bayes Hilbert space \((B^2 (a,b), \oplus, \odot, \langle \cdot, \cdot \rangle)\) (Van Den Boogaart, Egozcue, and Pawlowsky-Glahn 2014).
  • Inner product \[\langle f, g \rangle_{B^2 (a,b)} = \frac1{2(b-a)} \int_a^b \int_a^b \frac{\ln f(x)}{\ln f(y)} \frac{\ln g(x)}{\ln g(y)} dxdy\]

General framework

  • \((E, \langle \cdot, \cdot \rangle)\): \(p\)-dimensional Euclidean space.
  • Why a Euclidean space? To define scatter operators, an inner product is required to have a notion of symmetry and positiveness.
  • Let \(X\) be an \(E\)-valued random object, from which we observe a sample \(D = (X_1, \dots, X_n)\).

Scatter operators

Case study

Code
library(ICS)

oct_eval <- oct_slope |>
  mutate(tx = as.fd(tx)) |>
  eval_funs(tx, n = 100) |>
  st_drop_geometry() |>
  select(x, y, id) |>
  pivot_wider(names_from = id, values_from = y) |>
  column_to_rownames("x") |>
  t()

oct_cov <- cov(oct_eval)
eps <- 1e-8 * matrix(rnorm(ncol(oct_eval) * nrow(oct_eval), sd = max(abs(oct_eval))),
  nrow = nrow(oct_eval)
)
oct_cov4 <- cov4(oct_eval + eps)

oct_slope_scatters <- crossing(t_1 = colnames(oct_eval), t_2 = colnames(oct_eval)) |>
  rowwise() |>
  mutate(Cov = oct_cov[t_1, t_2], Cov_4 = oct_cov4[t_1, t_2]) |>
  pivot_longer(c(Cov, Cov_4), names_to = "scatter", values_to = "z") |>
  mutate(t_1 = as.numeric(t_1), t_2 = as.numeric(t_2))

tx_ics <- dda::ICS(oct_slope$tx)
oct_basis <- oct_slope$tx[[1]]$basis

changemat <- to_zbsplines(basis = oct_basis, inv = TRUE)
gram <- t(changemat) %*% fda::inprod(oct_basis, oct_basis) %*% changemat
tx_cov_zb <- cov(t(to_zbsplines(c(oct_slope$tx))))
tx_cov4_zb <- cov4(t(to_zbsplines(c(oct_slope$tx))))

# changemat_inv <- to_zbsplines(basis = oct_basis)
# tx_cov_b <- t(changemat_inv) %*% tx_cov_zb %*% changemat_inv
# tx_cov4_b <- t(changemat_inv) %*% tx_cov4_zb %*% changemat_inv

rangeval <- oct_slope$tx[[1]]$basis$rangeval
tval <- seq(rangeval[1], rangeval[2], length.out = 100)

zbsp <- fd(to_zbsplines(inv = TRUE, coefs = diag(11), basis = oct_basis), oct_basis)
zbsp_tval <- eval.fd(tval, zbsp)
tx_cov_grid <- zbsp_tval %*% tx_cov_zb %*% t(zbsp_tval)
tx_cov4_grid <- zbsp_tval %*% tx_cov4_zb %*% t(zbsp_tval)
dimnames(tx_cov_grid) <- list(tval, tval)
dimnames(tx_cov4_grid) <- list(tval, tval)

oct_slope_scatters_2 <- reshape2::melt(tx_cov_grid) |>
  rename(t_1 = Var1, t_2 = Var2, Cov = value) |>
  mutate(Cov_4 = reshape2::melt(tx_cov4_grid)$value) |>
  pivot_longer(c(Cov, Cov_4), names_to = "scatter", values_to = "z")

ggplot(oct_slope_scatters_2) +
  stat_contour_filled(aes(t_1, t_2, z = z)) +
  facet_wrap(vars(scatter), nrow = 2) +
  labs(x = "Temperature", y = "Temperature", fill = "Value")
Figure 6: Empirical estimates of the scatter operators \(\color{#0072b2}{\operatorname{Cov}} [X]\) and \(\color{#d55e00}{\operatorname{Cov}_4} [X]\) for the data set of slope densities of temperature change in Occitanie.

Theory

  • \(w\)-weighted covariance operator \[\begin{multline*}\langle \operatorname{Cov}_w [X] x, y \rangle \\ = \mathbb E [ w^2(\| \operatorname{Cov}[X]^{-1/2} (X - \mathbb E X) \|) \\ \langle X - \mathbb EX, x \rangle \langle X - \mathbb EX, y \rangle ] \end{multline*}\] for \((x,y) \in E\).

  • Examples.

    1. \(w = 1\) defines \(\color{#0072b2}{\operatorname{Cov}}\);
    2. \(w = \operatorname{id}\) defines \(\color{#d55e00}{\operatorname{Cov}_4}\).
  • Affine equivariant and defined coordinate-freely: no reference to any basis, exist on every Euclidean space.

Coordinate-free ICS

ICS problem

Theory

Definition 1 (ICS problem) \(\operatorname{ICS} (X, \color{#0072b2}{S_1},\color{#d55e00}{S_2})\): joint diagonalisation of \(\color{#0072b2}{S_1}[X]\) and \(\color{#d55e00}{S_2}[X]\), i.e:

\[\left\{ \begin{aligned} \langle \color{#0072b2}{S_1} [X] h_j, h_{j'} \rangle &= \delta_{jj'} \\ \langle \color{#d55e00}{S_2}[X] h_j, h_{j'} \rangle &= \lambda_j \delta_{jj'} \end{aligned} \right. \] for all \(1 \leq j,j' \leq p\) where

  • \(\lambda_1 \geq \dots \geq \lambda_p \geq 0\) are called ICS eigenvalues.
  • \((h_1, \dots, h_p)\) is an ICS eigenbasis.

Solutions exist provided that \(\color{#0072b2}{S_1} [X]\) is invertible.

Case study

Code
library(dda)

tx_ics <- dda::ICS(oct_slope$tx, )

tx_ics_eigen <- tibble(
  index = factor(names(tx_ics$gen_kurtosis),
    levels = names(tx_ics$gen_kurtosis)
  ),
  gen_kurtosis = tx_ics$gen_kurtosis,
  H_dual = as.list(tx_ics$H_dual),
)

tx_ics_eigen |>
  ggplot(aes(index, gen_kurtosis, group = 1)) +
  geom_line(alpha = 0.5) +
  geom_point(aes(color = gen_kurtosis), size = 3) +
  labs(x = "", y = "ICS eigenvalues") +
  guides(color = "none")
Figure 7: Scree plot of ICS eigenvalues for the data set of slopes densities in Occitanie.
Code
tx_ics_eigen |>
  mutate(H_dual = as_dd(H_dual)) |>
  plot_funs(H_dual, color = gen_kurtosis) +
  facet_wrap(vars(index)) +
  geom_hline(
    yintercept = 1 / diff(oct_slope$tx[[1]]$basis$rangeval),
    linetype = "dotted"
  ) +
  scale_y_continuous(n.breaks = 3) +
  labs(x = "Temperature", y = "Density") +
  guides(color = "none")
Figure 8: ICS eigendensities for the data set of slope densities in Occitanie.

Invariant coordinates

Case study

Code
oct_slope |>
  cbind(as_tibble(tx_ics$scores)) |>
  ggplot(aes(IC.1, IC.2, color = dept)) +
  geom_point() +
  coord_fixed() +
  labs(color = "Department")
Figure 9: First two invariant coordinates coloured by department.

Theory

For \(1 \leq j \leq p\): \[z_j = \langle X - \mathbb EX, h_j \rangle\] is the \(i\)-th invariant coordinate of \(X\).

Reconstructing the curves

Case study

Code
h_dual_shift <- tx_ics$H_dual
h_dual_shift$coefs <- sweep(h_dual_shift$coefs, 1, -c(mean(oct_slope$tx))$coefs)
tx_ics_eigen |>
  mutate(H_dual_shift = as.list(as_dd(h_dual_shift))) |>
  plot_funs(H_dual_shift, color = gen_kurtosis) +
  geom_function(
    fun = function(x) eval(c(mean(oct_slope$tx)), x),
    color = "black", linetype = "dotted"
  ) +
  scale_y_continuous(n.breaks = 3) +
  facet_wrap(vars(index)) +
  labs(x = "Temperature", y = "Density") +
  guides(color = "none")
Figure 10: Mean slope density across Occitanie, shifted by each ICS dual eigendensity.

Theory

Proposition 1 (Reconstructing formula) \[X = \mathbb EX + \sum_{j=1}^p z_j h^*_j\] where \[\begin{aligned} H^* &= (h^*_j)_{1 \leq j \leq p} \\ &= (\color{#0072b2}{S_1} [X] h_j)_{1 \leq j \leq p} \end{aligned}\] is the dual basis of \(H=(h_j)_{1 \leq j \leq p}\).

Implementation

Let

  • \(B\) be any basis of \(E\);
  • \(G_B = (\langle b_i, b_j \rangle)_{1 \leq i,j \leq p}\) its Gram matrix;
  • \([\cdot]_B: E \rightarrow \mathbb R^p\) the coordinates in basis \(B\).

Corollary 1 (of Proposition 2) For any basis \(H\) of \(E\), these are equivalent:

0. \(H\) solves \(\operatorname{ICS} (X, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(E\)
1. \(\color{#d55e00}{G_B^{1/2}} [H]_B\) solves \(\operatorname{ICS} (\color{#d55e00}{G_B^{1/2}} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)
2. \([H]_B\) solves \(\operatorname{ICS} (\color{#d55e00}{G_B} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)
3. \(\color{#d55e00}{G_B} [H]_B\) solves \(\operatorname{ICS} ([X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) over \(\mathbb R^p\)

Outlier detection using ICS

Theory

Based on ICSOutlier (Archimbaud, Nordhausen, and Ruiz-Gazen 2018a).

  1. Compute the invariant coordinates.
  1. Select \(k\) components. Using the scree plot or D’Agostino normality tests, etc. Now compute the ICS distances: \(\sum_{j=1}^k |z_j|^2\).
  1. Choose a cut-off \(\tau\). Based on Monte Carlo simulations to compute quantiles (Archimbaud, Nordhausen, and Ruiz-Gazen 2018b).

Case study

Code
tx_out <- dda::ICS_outlier(oct_slope$tx, index = 1)

oct_slope <- oct_slope |>
  mutate(
    distance = tx_out$ics_distances,
    outlying = tx_out$outliers == 1
  )

oct_slope |>
  mutate(name = ifelse(outlying, name, NA)) |>
  ggplot(aes(id, distance, color = dept, label = name)) +
  geom_point() +
  geom_text() +
  scale_color_discrete(limits = function(x) ifelse(is.na(x), "", x)) +
  labs(color = "Outlying\nstations") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank()
  )
Figure 11: ICS distances for \(k=1\) selected component.
Code
oct_slope |>
  mutate(name = ifelse(outlying, name, NA)) |>
  ggplot(aes(id, distance, color = dept, label = name)) +
  geom_point() +
  geom_text() +
  scale_color_discrete(limits = function(x) ifelse(is.na(x), "", x)) +
  geom_hline(yintercept = tx_out$ics_dist_cutoff) +
  labs(color = "Outlying\nstations") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank()
  )
Figure 12: ICS distances for k=1 selected component, with threshold 5.5.

Results for the case study

Code
oct_slope |>
  filter(outlying) |>
  plot_funs(tx, color = name) +
  geom_hline(
    yintercept = 1 / diff(oct_slope$tx[[1]]$basis$rangeval),
    linetype = "dotted"
  ) +
  geom_function(
    fun = function(x) eval(mean(c(oct_slope$tx)), x),
    color = "black"
  ) +
  labs(x = "Temperature", y = "Density") +
  guides(color = "none")

ggplot(ocdepts) +
  geom_sf(alpha = 0) +
  geom_sf(data = filter(oct_slope, outlying), aes(color = name)) +
  geom_sf_text(data = filter(oct_slope, outlying), aes(color = name, label = name)) +
  labs(x = "", y = "", color = "Station")
Figure 13: Outlying stations for the data set of slope densities in Occitanie.
Figure 14: Outlying stations on a map of Occitanie.

Comparison with other methods

Monte Carlo simulations

  1. Generate density data with outliers according to 3 different outlier schemes.

  2. (Optionally) apply a transformation:

  • log quantile density
  • quantile function

then choose a metric:

  • \(L^2 (a,b)\)
  • \(B^2 (a,b)\)
  • Wasserstein.
  1. Compute scores for 4 different functional outlier detection methods adapted to density data:

ROC curves

Code
load("data/ICS_comparison.RData")

# Aggregate results (averaging over simulations)
comparison_avg <- comparison |>
  separate_wider_delim(Method, "_", names = c("Method", "Transf")) |>
  group_by(Scheme, Transf, Method, PP) |>
  summarize(
    n = n(),
    TPR_avg = mean(TPR),
    TPR_conf = qnorm(0.975) * sd(TPR) / sqrt(n()),
    FPR_avg = mean(FPR),
    FPR_conf = qnorm(0.975) * sd(FPR) / sqrt(n()),
    .groups = "drop"
  )

ggplot(comparison_avg, aes(FPR_avg, TPR_avg, color = Scheme, group = Scheme)) +
  geom_line() +
  geom_ribbon(aes(
    ymin = pmax(0, TPR_avg - TPR_conf),
    ymax = pmin(1, TPR_avg + TPR_conf),
    xmin = pmax(0, FPR_avg - FPR_conf),
    xmax = pmin(1, FPR_avg + FPR_conf),
    fill = Scheme
  ), alpha = 0.1, linetype = "dotted") +
  geom_function(fun = identity, linetype = "dotted") +
  facet_wrap(vars(paste(Method, Transf, sep = "_")), nrow = 2) +
  coord_fixed() +
  labs(
    x = "False Positive Rate (FPR)",
    y = "True Positive Rate (TPR)",
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
Figure 15: Comparing ICS with other distributional outlier detection methods through simulated ROC curves for several outlier generating schemes.

Area under the curve (AUC)

Code
comparison_auc <- comparison_avg |>
  group_by(Scheme, Method, Transf) |>
  summarise(AUC = integrate(approxfun(FPR_avg, TPR_avg), 0, 1)$value)

comparison_auc_summary <- comparison_auc |>
  group_by(Method, Transf) |>
  summarise(AUC_avg = mean(AUC)) |>
  arrange(desc(AUC_avg))

ggplot(comparison_auc_summary, aes(reorder(paste(Method, Transf, sep = "\n"), -AUC_avg), AUC_avg, fill = Method)) +
  geom_col() +
  coord_cartesian(ylim = c(0.5, NA)) +
  labs(x = "")
Figure 16: Area under the ROC curve for each outlier detection method.

Thank you for listening!

For more details: Mondon et al. (2025).

Any questions?

References

Archimbaud A., Nordhausen K. and Ruiz-Gazen A. 2018a. ICSOutlier: Unsupervised Outlier Detection for Low- Dimensional Contamination Structure.” The R Journal 10 (1) : 234–50.
———. 2018b. ICS for multivariate outlier detection with application to quality control.” Computational Statistics & Data Analysis 128 (December) : 184–99.
Machalová J., Talská R., Hron K. and Gába A. 2021. Compositional splines for representation of density functions.” Computational Statistics 36 (2) : 1031–64.
Maier E.-M., Stöcker A., Fitzenberger B. and Greven S. 2025. Additive density-on-scalar regression in Bayes Hilbert spaces with an application to gender economics.” The Annals of Applied Statistics 19 (1) : 680–700.
Mondon C., Trinh H. T., Ruiz-Gazen A. and Thomas-Agnan C. 2025. ICS for complex data with application to outlier detection for density data.” arXiv.
Murph A. C., Strait J. D., Moran K. R., Hyman J. D. and Stauffer P. H. 2024. Visualisation and outlier detection for probability density function ensembles.” Stat 13 (2) : e662.
Ojo O. T., Fernández Anta A., Lillo R. E. and Sguera C. 2022. Detecting and classifying outliers in big functional data.” Advances in Data Analysis and Classification 16 (3) : 725–60.
Sun Y. and Genton M. G. 2011. Functional Boxplots.” Journal of Computational and Graphical Statistics 20 (2) : 316–34.
Van Den Boogaart K. G., Egozcue J. J. and Pawlowsky-Glahn V. 2014. Bayes Hilbert Spaces.” Australian & New Zealand Journal of Statistics 56 (2) : 171–94.

Appendix

Scatter operators

  • \((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\).

  • \(X \in \mathcal M (\Omega, E)\) random variable over \(E\).

  • \(\mathcal S^{+} (E)\) space of non-negative symmetric operators over \(E\).

  • \(\mathcal {GS}^{+} (E)\) space of positive symmetric operators over \(E\).

  • \(S: \mathcal A \subseteq \mathcal M (\Omega, E) \rightarrow \mathcal S^{+} (E)\) scatter operator over \(\mathcal A\) if:

    • Invariance by equality in distribution: \[\forall (X,Y) \in \mathcal A^2, X \sim Y \Rightarrow S[X] = S[Y] \]
    • It is affine equivariant if: \[\forall A \in \mathcal{GL} (E), \forall b \in E, S[AX+b] = A S[X] A^*\]

Proof of Proposition 1

Proof.

Let us decompose \(S_1[X]^{-1} (X - \mathbb EX)\) over the basis \(H\), which is orthonormal in \((E, \langle \cdot, S_1[X] \cdot \rangle)\): \[\begin{aligned} S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_j \rangle h_j \\ &= \sum_{i=1}^p \langle X - \mathbb EX, h_j \rangle h_j \\ S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_j h_j \end{aligned}\]

The dual basis \(H^*\) of \(H\) is the one that satisfies \(\langle h_j, h^*_j \rangle = \delta_{ij}\) for all \(1 \leq i,j \leq p\) and we know from the definition of ICS that this holds for \((S_1[X] h_j)_{1 \leq j \leq p}\).

Proof of Corollary 1

Proof. In ICS, the first step is to centre the data so wlog: \(\mathbb E [X] = 0\).
\(E\) is isometric to \(\mathbb R^p\) via the linear isomorphism: \[\phi_B: x \mapsto G_B^{1/2} [x]_B\] Then for any \((x,y) \in E^2\): \[ \begin{aligned} \langle \operatorname{Cov}_w [X] x, y \rangle_E &= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\ &= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\ &= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\ &= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\ &= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle \end{aligned} \]

Bayes space and CB-splines

Source: Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).

  • \(P = \frac1{\lambda(I)} \lambda\) uniform probability measure over \(I=[a,b]\).
  • \(L^2_0 (P) = \{ f \in L^2 (P) | \int_I f \mathrm dP = 0 \}\) is a separable Hilbert space.
  • Centred log-ratio transform \(\operatorname{clr}: p \in \mathbb R^I \mapsto \log(p) - \int_I \log(p) \mathrm dP\) (if it exists)
  • Bayes space \(B^2 (P) = \{ \operatorname{clr}^{-1} (\{f\}), f \in L^2_0 (I) \}\) inherits the separable Hilbert space structure of \(L^2_0 (P)\). Each equivalence class contains a density function defined almost everywhere.
  • B-splines basis \(B=(B_i)_{-q \leq i \leq k}\) of \(\mathcal S_{q+1}^\Delta (I) \subset L^2 (P)\), the subspace of polynomial splines on \(I\) of order \(q+1\) and with knots \(\Delta\).
  • ZB-splines basis \(Z=(Z_i)_{-q+1 \leq i \leq k-1} = \left(\frac{\mathrm d B_i}{\mathrm dx}\right)_{-q+1 \leq i \leq k-1}\) of \(\mathcal Z_q^\Delta (I) = \mathcal S_q^\Delta (I) \cap L^2_0 (P)\)
  • CB-splines basis \(C = (C_i)_{-q+1 \leq i \leq k-1} = (\operatorname{clr}^{-1} (Z_i))_{-q+1 \leq i \leq k-1}\) of \(\mathcal C_q^\Delta (I)\) which is a Euclidean subspace of \(B^2 (P)\) of dimension \(p=k+q-1\).