55ᵉ Journées de Statistique, Université de Bordeaux
May 30, 2024
Source: CPC Global Unified Temperature (NOAA PSL 1979), see also (Trinh, Simioni, and Thomas-Agnan 2023).
Figure 1: Daily maximum temperatures per province in Vietnam on several days of 2016.
data(vietnam_temperature)
selected_years <- 2013:2016
vnt <- vietnam_temperature |>
filter(year %in% selected_years) |>
filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
left_join(vietnam_regions, by = c("region" = "code")) %>%
mutate(region = name) |>
arrange(region, province) |>
select(!c(t_min, name)) |>
mutate(t_max = as_dd(t_max, lambda = 1, nbasis = 10, norder = 4))
class(vnt$t_max) <- c("ddl", "fdl", "list")
vnt |> plot_funs(t_max) +
labs(x = "Daily max. temperature (°C)", y = "Density")
Figure 2: Yearly densities of daily max. temperature of some Vietnamese provinces between 2013 and 2016.
plot_with_ellipses <- function(data, blue = FALSE, ...) {
plot(
data,
asp = 1,
...
)
lines(
ellipse::ellipse(cov(data), centre = colMeans(data), level = sqrt(3 / 4)),
type = "l",
lty = 2,
col = "#0072b2"
)
arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
e <- eigen(cov(data))
arrows_stop <- arrows_start - t(e$vectors) * 2 *
e$values**0.5
if (blue) {
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#0072b2"
)
}
points(t(colMeans(data)), col = "#0072b2", pch = 3)
lines(
ellipse::ellipse(ICS::cov4(data),
centre = colMeans(data),
level = sqrt(3 / 4)
),
type = "l",
lty = 2,
col = "#d55e00"
)
arrows_start <- matrix(colMeans(data), 2, 2, byrow = TRUE)
e <- eigen(ICS::cov4(data))
arrows_stop <- arrows_start - t(e$vectors) * 2 *
e$values**0.5
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#d55e00"
)
}
data(starsCYG, package = "robustbase")
x <- starsCYG
plot_with_ellipses(x, blue = TRUE)
Figure 3: The two starsCYG
variables and the ellipses associated with Cov and Cov4. The blue arrows are the main axes of the Principal Component Analysis.
Figure 4: The two starsCYG
variables after whitening and the ellipses associated with Cov and Cov4.
Figure 5: The two starsCYG
variables after rotating to diagonalise Cov4 and the ellipses associated with Cov and Cov4. We obtain the invariant coordinates (IC).
plot_with_ellipses(x, blue = TRUE)
arrows_start <- matrix(colMeans(x), 2, 2, byrow = TRUE)
e <- list(vectors = icsobj$W, values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * sign(diag(e$vectors)) * 0.5
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#009e73"
)
e <- list(vectors = solve(t(icsobj$W)), values = icsobj$gen_kurtosis)
arrows_stop <- arrows_start - e$vectors * 2 * sign(diag(e$vectors))
arrows(arrows_start[, 1], arrows_start[, 2],
arrows_stop[, 1], arrows_stop[, 2],
col = "#009e73",
lty = "dotted"
)
Figure 6: The two original starsCYG
variables and the ellipses associated with Cov and Cov4. In green is the basis given by ICS.
Definition 1 (ICS problem)
ICS(X,S1,S2): in (E,⟨⋅,S1[X]⋅⟩) Euclidean space of dimension p, find an orthonormal basis H=(h1,…,hp) diagonalising the symmetric operator S1[X]−1S2[X], i.e:
{⟨S1[X]hi,hj⟩=δij⟨S2[X]hi,hj⟩=λiδij for all 1≤i,j≤p where λ1≥⋯≥λp≥0 are called generalised kurtosis.
zi=⟨X−EX,hi⟩,1≤i≤p are called invariant coordinates.
Proposition 1 (Reconstructing formula) X=EX+∑pi=1zih∗i where H∗=(h∗i)1≤i≤p=(S1[X]hi)1≤i≤p is the dual basis of H.
R
packages:
ICS
ICSOutlier
, ICSShiny
: application to outlier detectionICSClust
: application to clusteringfda::density.fd
computes B-splines coordinates of approximated solutions to the Maximum Penalised Likelihood problem.Let
Proposition 2 For any basis H of E, these are equivalent:
0. | H | solves | \operatorname{ICS} (X, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2}) | over | E |
1. | \color{#d55e00}{G_B^{1/2}} [H]_B | solves | \operatorname{ICS} (\color{#d55e00}{G_B^{1/2}} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2}) | over | \mathbb R^p |
2. | [H]_B | solves | \operatorname{ICS} (\color{#d55e00}{G_B} [X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2}) | over | \mathbb R^p |
3. | \color{#d55e00}{G_B} [H]_B | solves | \operatorname{ICS} ([X]_B, \operatorname{Cov}_{w_1}, \operatorname{Cov}_{w_2}) | over | \mathbb R^p |
Proof. In ICS, the first step is to centre the data so wlog: \mathbb E [X] = 0.
E is isometric to \mathbb R^p via the linear isomorphism: \phi_B: x \mapsto G_B^{1/2} [x]_B Then for any (x,y) \in E^2:
\begin{aligned}
\langle \operatorname{Cov}_w [X] x, y \rangle_E
&= \mathbb E [ w^2(X) \langle X, x \rangle_E \langle X, y \rangle_E ] \\
&= \mathbb E [ w^2([X]_B) \langle G_B^{1/2} [X]_B, G_B^{1/2} [x]_B \rangle \langle G_B^{1/2} [X]_B, G_B^{1/2} [y]_B \rangle ] \\
&= \langle \operatorname{Cov}_w (G_B^{1/2} [X]_B) G_B^{1/2} [x]_B, G_B^{1/2} [y]_B \rangle \\
&= \langle \operatorname{Cov}_w (G_B [X]_B) [x]_B, [y]_B \rangle \\
&= \langle \operatorname{Cov}_w ([X]_B) G_B [x]_B, G_B [y]_B \rangle
\end{aligned}
library(dda)
t_max_ics <- ICS(vnt$t_max)
vnt |>
cbind(as_tibble(t_max_ics$scores)) |>
ggplot(aes(IC.1, IC.2,
label = seq_len(nrow(vnt)), color = region, shape = NA
)) +
geom_point() +
geom_text() +
coord_fixed() +
guides(shape = "none")
vnt |> plot_funs(t_max, color = region) +
facet_wrap(vars(year)) +
labs(x = "Max. temperature (°C)", y = "Density")
dda
package and submit to CRAN(E, \langle \cdot, \cdot \rangle) Euclidean space of dimension p.
X \in \mathcal M (\Omega, E) random variable over E.
\mathcal S^{+} (E) space of non-negative symmetric operators over E.
\mathcal {GS}^{+} (E) space of positive symmetric operators over E.
S: \mathcal A \subseteq \mathcal M (\Omega, E) \rightarrow \mathcal S^{+} (E) scatter operator over \mathcal A if:
Proof. Let us decompose S_1[X]^{-1} (X - \mathbb EX) over the basis H, which is orthonormal in (E, \langle \cdot, S_1[X] \cdot \rangle): \begin{aligned} S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p \langle S_1[X]^{-1} (X - \mathbb EX), S_1[X] h_i \rangle h_i \\ &= \sum_{i=1}^p \langle X - \mathbb EX, h_i \rangle h_i \\ S_1[X]^{-1} (X - \mathbb EX) &= \sum_{i=1}^p z_i h_i \end{aligned} The dual basis H^* of H is the one that satisfies \langle h_i, h^*_j \rangle = \delta_{ij} for all 1 \leq i,j \leq p and we know from the definition of ICS that this holds for (S_1[X] h_i)_{1 \leq i \leq p}.
library(tidyverse)
library(dda)
library(compositions)
data(vietnam_temperature)
data(vietnam_regions)
selected_years <- 2013:2016
hist_zeroreplace <- function(x) {
N <- sum(x$counts)
n <- length(x$counts)
dl <- rep(1 / N, n) # Detection limit
setNames(c(compositions::zeroreplace(x$density, dl)), x$mids)
}
vnth <- vietnam_temperature |>
filter(year %in% selected_years) |>
filter(region %in% c("NMM", "RRD") | province %in% c("ANGIANG", "BACLIEU")) |>
left_join(vietnam_regions, by = c("region" = "code")) %>%
mutate(region = name) |>
arrange(region, province) |>
select(!c(t_min, name)) |>
mutate(t_max = lapply(t_max, function(x) {
hist(x,
seq(min(unlist(t_max)), max(unlist(t_max)), length.out = 12),
plot = FALSE
)
})) |>
mutate(t_max = lapply(t_max, hist_zeroreplace)) |>
unnest_wider(t_max, names_sep = ".")
t_max_ilr_ics <- vnth |>
select(starts_with("t_max")) |>
acomp() |>
ilr() |>
ICS(center = TRUE)
vnth |>
cbind(as_tibble(t_max_ilr_ics$scores)) |>
ggplot(aes(IC.1, IC.2,
label = seq_len(nrow(vnth)), color = region, shape = NA
)) +
geom_point() +
geom_text() +
coord_fixed() +
guides(shape = "none")
Figure 9: First two invariant coordinates coloured by region.
Source: Van Den Boogaart, Egozcue, and Pawlowsky-Glahn (2014) and Machalová et al. (2021).