55ᵉ Journées de Statistique, Université de Bordeaux
May 30, 2024
Source: CPC Global Unified Temperature (NOAA PSL 1979), see also (Trinh, Simioni, and Thomas-Agnan 2023).
data(vietnam_temperature)
selected_years <- 2013:2016
vnt <- vietnam_temperature |>
filter(year %in% selected_years) |>
filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
left_join(vietnam_regions, by = c("region" = "code")) %>%
mutate(region = name) |>
arrange(region, province) |>
select(!c(t_min, name)) |>
mutate(t_max = as_dd(t_max, lambda = 1, nbasis = 10, norder = 4))
class(vnt$t_max) <- c("ddl", "fdl", "list")
vnt |> plot_funs(t_max) +
labs(x = "Daily max. temperature (°C)", y = "Density")
plot_with_ellipses <- function(data, blue = FALSE, ...) {
plot(
data,
asp = 1,
...
)
lines(
ellipse::ellipse(cov(data), centre = colMeans(data), level = sqrt(3 / 4)),
type = "l",
lty = 2,
col = "#0072b2"
)
arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
e <- eigen(cov(data))
arrows_stop <- arrows_start - t(e$vectors) * 2 *
e$values**0.5
if (blue) {
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#0072b2"
)
}
points(t(colMeans(data)), col = "#0072b2", pch = 3)
lines(
ellipse::ellipse(ICS::cov4(data),
centre = colMeans(data),
level = sqrt(3 / 4)
),
type = "l",
lty = 2,
col = "#d55e00"
)
arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
e <- eigen(ICS::cov4(data))
arrows_stop <- arrows_start - t(e$vectors) * 2 *
e$values**0.5
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#d55e00"
)
}
data(starsCYG, package = "robustbase")
x <- starsCYG
plot_with_ellipses(x, blue = TRUE)
plot_with_ellipses(x, blue = TRUE)
arrows_start <- matrix(colMeans(x), 2, 2, byrow = TRUE)
e <- list(vectors = icsobj$W, values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * sign(diag(e$vectors)) * 0.5
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#009e73"
)
e <- list(vectors = solve(t(icsobj$W)), values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * 2 * sign(diag(e$vectors))
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#009e73",
lty = "dotted"
)
Definition 1 (ICS problem)
\(\operatorname{ICS} (X, \color{#0072b2}{S_1},\color{#d55e00}{S_2})\): in \((E, \langle \cdot, \color{#0072b2}{S_1}[X] \cdot \rangle)\) Euclidean space of dimension \(p\), find an orthonormal basis \(\color{#009e73}{H} = (h_1, \dots, h_p)\) diagonalising the symmetric operator \(\color{#0072b2}{S_1}[X]^{-1} \color{#d55e00}{S_2}[X]\), i.e:
\[\left\{ \begin{aligned} \langle \color{#0072b2}{S_1} [X] h_i, h_j \rangle &= \delta_{ij} \\ \langle \color{#d55e00}{S_2}[X] h_i, h_j \rangle &= \lambda_i \delta_{ij} \end{aligned} \right. \text{ for all }1 \leq i,j \leq p\] where \(\lambda_1 \geq \dots \geq \lambda_p \geq 0\) are called generalised kurtosis.
\(z_i = \langle X - \mathbb EX, h_i \rangle, 1 \leq i \leq p\) are called invariant coordinates.
Proposition 1 (Reconstructing formula) \(X = \mathbb EX + \sum_{i=1}^p z_i h^*_i\) where \(\color{#009e73}{H^*} = (h^*_i)_{1 \leq i \leq p} = (\color{#0072b2}{S_1} [X] h_i)_{1 \leq i \leq p}\) is the dual basis of \(\color{#009e73}{H}\).
R
packages:
ICS
ICSOutlier
, ICSShiny
: application to outlier detectionICSClust
: application to clusteringfda::density.fd
computes B-splines coordinates of approximated solutions to the Maximum Penalised Likelihood problem.Let
Proposition 2 For any basis \(H\) of \(E\), these are equivalent:
0. | \(H\) | solves | \(\operatorname{ICS} (X, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) | over | \(E\) |
1. | \(\color{#d55e00}{G_B^{1/2}} [H]_B\) | solves | \(\operatorname{ICS} (\color{#d55e00}{G_B^{1/2}} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) | over | \(\mathbb R^p\) |
2. | \([H]_B\) | solves | \(\operatorname{ICS} (\color{#d55e00}{G_B} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) | over | \(\mathbb R^p\) |
3. | \(\color{#d55e00}{G_B} [H]_B\) | solves | \(\operatorname{ICS} ([X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2})\) | over | \(\mathbb R^p\) |
Proof. In ICS, the first step is to centre the data so wlog: \(\mathbb E [X] = 0\).
\(E\) is isometric to \(\mathbb R^p\) via the linear isomorphism: \[\phi_B: x \mapsto G_B^{1/2} [x]_B\] Then for any \((x,y) \in E^2\): \[
\begin{aligned}
\langle \operatorname{Cov}_w [X] x, y \rangle_E
&= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\
&= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\
&= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\
&= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\
&= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle
\end{aligned}
\]
library(dda)
t_max_ics <- ICS(vnt$t_max)
vnt |>
cbind(as_tibble(t_max_ics$scores)) |>
ggplot(aes(IC.1, IC.2,
label = seq_len(nrow(vnt)), color = region, shape = NA
)) +
geom_point() +
geom_text() +
coord_fixed() +
guides(shape = "none")
vnt |> plot_funs(t_max, color = region) +
facet_wrap(vars(year)) +
labs(x = "Max. temperature (°C)", y = "Density")
dda
package and submit to CRAN\((E, \langle \cdot, \cdot \rangle)\) Euclidean space of dimension \(p\).
\(X \in \mathcal M (\Omega, E)\) random variable over \(E\).
\(\mathcal S^{+} (E)\) space of non-negative symmetric operators over \(E\).
\(\mathcal {GS}^{+} (E)\) space of positive symmetric operators over \(E\).
\(S: \mathcal A \subseteq \mathcal M (\Omega, E) \rightarrow \mathcal S^{+} (E)\) scatter operator over \(\mathcal A\) if:
Proof. Let us decompose \(S_1[X]^{-1} (X - \mathbb EX)\) over the basis \(H\), which is orthonormal in \((E, \langle \cdot, S_1[X] \cdot \rangle)\): \[\begin{aligned} S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_i \rangle h_i \\ &= \sum_{i=1}^p \langle X - \mathbb EX, h_i \rangle h_i \\ S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_i h_i \end{aligned}\] The dual basis \(H^*\) of \(H\) is the one that satisfies \(\langle h_i, h^*_j \rangle = \delta_{ij}\) for all \(1 \leq i,j \leq p\) and we know from the definition of ICS that this holds for \((S_1[X] h_i)_{1 \leq i \leq p}\).
library(tidyverse)
library(dda)
library(compositions)
data(vietnam_temperature)
data(vietnam_regions)
selected_years <- 2013:2016
hist_zeroreplace <- function(x) {
N <- sum(x$counts)
n <- length(x$counts)
dl <- rep(1 / N, n) # Detection limit
setNames(c(compositions::zeroreplace(x$density, dl)), x$mids)
}
vnth <- vietnam_temperature |>
filter(year %in% selected_years) |>
filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
left_join(vietnam_regions, by = c("region" = "code")) %>%
mutate(region = name) |>
arrange(region, province) |>
select(!c(t_min, name)) |>
mutate(t_max = lapply(t_max, function(x) {
hist(x,
seq(min(unlist(t_max)), max(unlist(t_max)), length.out = 12),
plot = FALSE
)
})) |>
mutate(t_max = lapply(t_max, hist_zeroreplace)) |>
unnest_wider(t_max, names_sep = ".")
t_max_ilr_ics <- vnth |>
select(starts_with("t_max")) |>
acomp() |>
ilr() |>
ICS(center = TRUE)
vnth |>
cbind(as_tibble(t_max_ilr_ics$scores)) |>
ggplot(aes(IC.1, IC.2,
label = seq_len(nrow(vnth)), color = region, shape = NA
)) +
geom_point() +
geom_text() +
coord_fixed() +
guides(shape = "none")
Source: Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).