1  Prerequisites

1.1 Multivariate outlier detection and ICS

A common problem in multivariate data analysis is that of detecting outliers in a contaminated data set. If we denote \(\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\) the set of independent observations of a random variable \(\mathbf{X}\) on \(\mathbb{R}^D\), we need to model a contamination process from the original uncontaminated random variable \(\mathbf{X}_0\) to the contaminated \(\mathbf{X}\).

Code
source("../functions.R")
.q <- 3
.eps_0 <- 0.9
.theta <- runif(.q, 0, pi / 3)

mixture <- data.frame(rmvnormmix(
  n = 150,
  lambda = c(.eps_0, rep((1 - .eps_0) / .q, .q)),
  mu = t(matrix(c(
    0, 0, 5 * rbind(cos(.theta), sin(.theta))
  ), 2, .q + 1)),
  sigma = t(matrix(c(1, 1, rep(
    0.2, 2 * .q
  )), 2, .q + 1))
))
names(mixture) <- paste(rep("X", 2), 1:2, sep = "_")

plot_data(
  mixture,
  width = 4,
  height = 3,
  grid = TRUE,
)
Figure 1.1: \(n=150\) points generated from a Gaussian mixture model with \(D = 2\), \(q=3\), \(\left(\varepsilon_1, \varepsilon_2\right)= \left(0.05, 0.05\right)\).

There are many ways to design contaminated data, among them to model the data generating process (DGP) by a convex combination (called a mixture) of the original distribution function with other distributions modelling the outliers:

\[F_{\,\mathbf{X}} = \left(1-\varepsilon\right) F_0 + \sum_{k=1}^{k=q} \varepsilon_k F_k\] where \(F_0 = F_{\,\mathbf{X}_0}\), \(\varepsilon = \sum_{k=1}^{k=q} \varepsilon_k\) is the theoretical proportion of outliers in the data and if \(1 \leq k \leq q\), \(F_k\) denotes the distribution function of the \(k\)th cloud of outliers.

Example 1.1 Figure 1.1 gives a scatter plot of \(n=150\) observations generated in \(D=2\) dimensions from a mixture of \(q=3\) Gaussian distributions with \(\varepsilon=0.1\), \(\varepsilon_1=0.05\) and \(\varepsilon_2=0.05\). The two groups of outliers are clearly visible on this plot. But this may not be the case when the dimension \(D\) is larger than 3.

1.1.1 Elliptical distributions

The outlier detection method studied in this report relies on the lack of elliptical symmetry of the contaminated distribution, so we have to properly define elliptical distributions, which are a generalization of normal distributions.

Definition 1.1 (Elliptical distributions) Let \(\mu\) be a vector of \(\mathbb{R}^D\) and \(\Sigma \in \mathbb{R}^{D \times D}\) a positive semi-definite symmetric matrix. A random variable \(\mathbf{X}\) is said to follow a \(\left(\mu, \Sigma\right)\)-elliptical distribution if there exists a random variable such that:

  1. \(\mathbf{X}= \Sigma^{\frac{1}{2}} \mathbf{Z}+ \mu\);

  2. \(\mathbf{Z}\) has a spherical distribution, meaning that for every orthogonal matrix \(\mathop{\mathrm{Q}}\) the equality of distribution functions \(\mathbf{Z}\sim \mathop{\mathrm{Q}}\, \mathbf{Z}\) holds.

Example 1.2  

  • The multivariate Gaussian distribution with mean \(\mu\) and covariance matrix \(\Sigma\) is the most well-known elliptical distribution.
    Figure 1.2 gives a scatter plot of \(n=150\) observations generated in two dimensions with parameters \(\mu = \left( \begin{array}{c} 5 \\ 1 \end{array}\right)\) and \(\Sigma = \left( \begin{array}{cc} 2 & 1 \\ 1 & 2 \end{array}\right)\). The ellipse corresponds to a quantile of order \(0.98\).

  • Other distributions such as the multivariate Student distributions also belong to the family of elliptical distributions.

Code
elliptical <- data.frame(rmvn(n = 150, c(5, 1), rbind(c(2, 1), c(1, 2))))
names(elliptical) <- paste(rep("X", 2), 1:2, sep = "_")

plot_data(elliptical,
  grid = TRUE,
  cov_ellipse = TRUE,
)
Figure 1.2: \(n=150\) points generated from a normal distribution with \(\mu = \left( \begin{array}{c} 5 \\ 1 \end{array}\right)\), \(\Sigma = \left(\begin{array}{cc} 2 & 1 \\ 1 & 2 \end{array}\right)\).

Remark. For an elliptical distribution \(\mathbf{X}\), the vector parameter and the matrix parameter are uniquely defined by the distribution function of \(\mathbf{X}\), and in particular are functions of \(\mathbf{X}\) denoted respectively by \(\mu\left[\mathbf{X}\right]\) and \(\Sigma\left[\mathbf{X}\right]\).
This fact can be simply illustrated in the case of absolutely continuous elliptical distributions, since the density function is determined by its values on the \(\Sigma\)-shaped ellipses around \(\mu\) (see the following result, cited in (Tyler et al. 2009)).

Proposition 1.1 An absolutely continuous \(\left(\mu, \Sigma\right)\)-elliptical random variable \(\mathbf{X}\) on \(\mathbb{R}^D\) has a density function verifying, for every \(\mathbf{x}\) in \(\mathbb{R}^D\): \[f_{\,\mathbf{X}}\left(\mathbf{x}\right) = C_D \det\left(\Sigma\right)^{-\frac{1}{2}} g\left(\left(\mathbf{x}- \mu\right)^\top \Sigma^{-1} \left(\mathbf{x}- \mu\right)\right)\] where \(g:\mathbb{R}_+ \rightarrow \mathbb{R}_+\) is a non-negative function and \(C_D \in \mathbb{R}_+^*\) is a constant.

1.1.2 Location operators and scatter functionals

As elliptical distributions generally do not admit first and second moments, it is necessary to define generalized notions of expectation (location) and variance (scatter).

Definition 1.2 A location \(m\) is an operator from a subspace of random variables on \(\mathbb{R}^D\) onto \(\mathbb{R}^D\) which verifies the following two properties of the usual expectation \(\mathbb{E}\), for all random variables \(\mathbf{X}\) and \(\mathbf{Y}\) in the considered subspace, and every non-singular matrix \(\mathop{\mathrm{A}}\in \mathbb{R}^{D \times D}\) and \(b \in \mathbb{R}^D\):

  1. if \(\mathbf{X}\sim \mathbf{Y}\), \(m\left[\mathbf{X}\right] = m\left[\mathbf{Y}\right]\) (dependent on the distribution only);

  2. \(m\left[\mathop{\mathrm{A}}\mathbf{X}+ \mathbf{b}\right] = \mathop{\mathrm{A}}m\left[\mathbf{X}\right] + \mathbf{b}\) (affine equivariant).

A scatter \(S\) is a functional from a subspace of random variables on \(\mathbb{R}^D\) onto the convex cone of positive-definite matrices \(\mathbb{R}^D\) which verifies the following two properties of the usual (co-)variance \(\mathbb{V}\), for all random variables \(\mathbf{X}\) and \(\mathbf{Y}\) in the considered subspace, and every non-singular matrix \(\mathop{\mathrm{A}}\in \mathbb{R}^{D \times D}\) and \(b \in \mathbb{R}^D\):

  1. if \(\mathbf{X}\sim \mathbf{Y}\), \(S\left[\mathbf{X}\right] = S\left[\mathbf{Y}\right]\) (dependent on the distribution only);

  2. \(S\left[\mathop{\mathrm{A}}\mathbf{X}+ \mathbf{b}\right] = \mathop{\mathrm{A}}S\left[\mathbf{X}\right] \mathop{\mathrm{A}}^\top\) (affine equivariant).

Naturally, the subspace of elliptical distributions is stable under any affine transformation of \(\mathbb{R}^D\) and on this space, the vector parameter \(\mu\) defines a location operator while the matrix parameter \(\Sigma\) defines a scatter functional. There is a converse result (cited in (Tyler et al. 2009)):

Proposition 1.2 A location operator \(m\) and a scatter functional \(S\) defined on a subspace containing the elliptically distributed random variables on \(\mathbb{R}^D\), have to verify for every elliptically distributed random variable \(\mathbf{X}\): \[m\left[\mathbf{X}\right] = \mu\left[\mathbf{X}\right] \text{ and } S\left[\mathbf{X}\right] = \lambda \, \Sigma\left[\mathbf{X}\right]\] where \(\lambda \in \mathbb{R}_+\) is a constant depending on the distribution function of \(\mathbf{X}\).

Proof. Let \(\mathbf{X}\) be a \(\left(\mu, \Sigma\right)\)-elliptically distributed random variable on \(\mathbb{R}^D\), and \(\mathbf{Z}\) a random variable as in Definition 1.1. Then if \(\mathop{\mathrm{Q}}\) is an orthogonal matrix:

  • \(m\left[\mathbf{Z}\right] = m\left[\mathop{\mathrm{Q}}\mathbf{Z}\right] = \mathop{\mathrm{Q}}\, m\left[\mathbf{Z}\right]\) i.e. \(m\left[\mathbf{Z}\right] = 0\) (for \(\mathop{\mathrm{Q}}= -\mathop{\mathrm{I}}_D\)), so \(m\left[\mathbf{X}\right] = m\left[\Sigma^\frac{1}{2} \mathbf{Z}+ \mu\right] = \mu\).

  • \(S\left[\mathbf{Z}\right] = S\left[\mathop{\mathrm{Q}}\, \mathbf{Z}\right] = \mathop{\mathrm{Q}}S\left[\mathbf{Z}\right] \mathop{\mathrm{Q}}^\top\) i.e. \(S\left[\mathbf{X}\right] \mathop{\mathrm{Q}}= \mathop{\mathrm{Q}}S\left[\mathbf{X}\right]\) i.e. \(S\left[\mathbf{Z}\right] = \lambda \, \mathop{\mathrm{I}}_D\) for a \(\lambda \in \mathbb{R}_+\) (by showing for instance that \(S\left[\mathbf{Z}\right]\) must stabilize every line), so \(S\left[\mathbf{X}\right] = S\left[\Sigma^\frac{1}{2} \mathbf{Z}+ \mu\right] = \Sigma^\frac{1}{2} \left(\Sigma^\frac{1}{2}\right)^\top = \Sigma\).

Remark. Note that for a given elliptical distribution (generally the Gaussian distribution) and a given scatter matrix, the constant \(\lambda\) can be calculated and used to define a scatter matrix equal to the scatter parameter (see the example below).

An important scatter functional, other that \(\mathbb{V}\), is the scatter matrix of fourth order moments: \[\mathbb{V}_4\left[\mathbf{X}\right] = \tfrac{1}{D+2} \mathbb{E}\left[\left(\mathbf{X}- \mathbb{E}\left[\mathbf{X}\right]\right) \left(\mathbf{X}- \mathbb{E}\left[\mathbf{X}\right]\right)^\top {\mathbb{V}\left[\mathbf{X}\right]}^{-1} \left(\mathbf{X}- \mathbb{E}\left[\mathbf{X}\right]\right) \left(\mathbf{X}- \mathbb{E}\left[\mathbf{X}\right]\right)^\top\right]\] which is defined (for instance by Archimbaud, Nordhausen, and Ruiz-Gazen (2018a)) for random variables admitting the first four moments, and will be particularly useful for the ICS outlier detection method.
Note that the constant \(D+2\) has been calculated in order to verify that \(\mathbb{V}_4\left[\mathbf{X}\right] = \Sigma\left[\mathbf{X}\right]\) for a Gaussian distribution.

As usual in statistics, we consider a sample \(\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\) which is a set of independent and identically distributed random variables. We work conditionally to the observations \(\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\), which in that case amounts to treating them as constants. We will use estimated location operators \(\hat{m}\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\) and estimated scatter functionals \(\hat{S}\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\) which can be defined from the particular case of Definition 1.2 when the random variables \(\mathbf{X}\) and \(\mathbf{Y}\) are discrete (their probability distribution function is the empirical distribution function of the observations).

1.1.3 Whitening matrices and Mahalanobis distance

Code
plot_data(BDDSegXvol, col = "Blues")
Figure 1.3: BDDSegX data set (volumes), original.
Code
plot_data(whiten(BDDSegXvol), col = "Blues")
Figure 1.4: BDDSegX data set (volumes), whitened.

A first method to detect outliers is to center and whiten the data, which means applying a non-singular matrix \(\hat{\mathop{\mathrm{W}}}\) such that the transformed data set: \[\left(\begin{array}{c|c|c} \mathbf{Z}_1 & \dots & \mathbf{Z}_n \end{array}\right) = \hat{\mathop{\mathrm{W}}} \, \left(\begin{array}{c|c|c} \mathbf{Y}_1 & \dots & \mathbf{Y}_n \end{array}\right)\] (where \(\mathbf{Y}_j = \mathbf{X}_j - \hat{m}\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right), 1 \leq j \leq n\)) has an estimated location \(\hat{m}\left(\mathbf{Z}_1, \dots, \mathbf{Z}_n\right) = 0\) and an estimated scatter \(\hat{S}\left(\mathbf{Z}_1, \dots, \mathbf{Z}_n\right) = \mathop{\mathrm{I}}_D\) (standardized and decorrelated).
If \(\hat{\mathop{\mathrm{W}}}\) is a positive-definite matrix, applying this transformation corresponds to a change of inner product and of Euclidean norm, meaning for \(\mathbf{x}\) in \(\mathbb{R}^D\): \[\left\|\hat{\mathop{\mathrm{W}}} \mathbf{x}\right\| = \sqrt{\left(\hat{\mathop{\mathrm{W}}} \mathbf{x}\right)^\top \left(\hat{\mathop{\mathrm{W}}} \mathbf{x}\right)} = \sqrt{\mathbf{x}^\top \hat{\mathop{\mathrm{W}}}^2 \mathbf{x}}\]

The most common choice is \(\hat{\mathop{\mathrm{W}}} = {\hat{\mathop{\mathrm{S}}}}^{-\frac{1}{2}}\) (the inverse square root of a scatter functional evaluated on the observations) which corresponds to what we will call the ‘Mahalanobis transformation’. The distance associated to this transformation is called the ‘Mahalanobis distance’ (from the work of Mahalanobis (1936)).
This transformation can be applied to an absolutely continuous mixture model of elliptical distributions, for which the outliers will be the observations \(\mathbf{X}_j, 1 \leq j \leq n\) such that: \[\left\|\mathbf{Z}_j\right\| = \left\|{\hat{\mathop{\mathrm{S}}}}^{-\frac{1}{2}} \left(\mathbf{X}_j - \hat{m}\right)\right\| \geq Q\left(p\right)\] where \(Q\) is the quantile function of the uncontaminated elliptical distribution \(X_0\) (i.e. the generalized inverse of the cumulative distribution function associated with the function \(g\) defined in Proposition 1.1) and \(p \in \left(0,1\right)\) is the level of confidence required.
When second order moments exist, we can choose for \(\hat{\mathop{\mathrm{S}}}\) the sample (co-)variance matrix \(\hat{\Sigma}\) and for \(\hat{m}\) the sample mean. This allows an interpretation of the newly defined variables \(\left(\mathbf{Z}_1, \dots, \mathbf{Z}_n\right)\) by computing the (co-)variance matrix with the original variables \(\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\): \[\hat{\Sigma}_{\left(\mathbf{Z}_i,\mathbf{X}_j\right)} = \hat{\mathop{\mathrm{W}}}^{-1} = \hat{\Sigma}^{\frac{1}{2}} \, \text{ (since } \hat{\Sigma}\left(\mathbf{Y}_1, \dots, \mathbf{Y}_n\right) = \mathop{\mathrm{I}}_D \text{)}\] or the correlation matrix: \[\hat{\rho}_{\left(\mathbf{Z}_i,\mathbf{X}_j\right)} = \left( \frac{\left(\hat{\mathop{\mathrm{W}}}^{-1}\right)_{i,j}}{\sqrt{\sum_{1 \leq k \leq D} \left({\hat{\mathop{\mathrm{W}}}}^{-1}\right)_{k,j}^2}} \right)\]

Example 1.3 Figure 1.3 and Figure 1.4 give matrix scatter plots for the BDDSegX automotive market example. The original volumes \(\mathtt{V_1}, \dots, \mathtt{V_5}\) are plotted on the upper graph and exhibit some correlations, while the whitened date \(\mathtt{Z_1}, \dots, \mathtt{Z_5}\) can be found on the lower graph and have a zero mean vector and a sample (co-)variance matrix equal to the identity.
Note that the dark blue corresponds to the beginning of the period (2003) and becomes lighter as the years go by in order to visualize also the temporal aspect of the data.

As the outliers are isolated due to their large deviation with respect to the estimated location, two problems emerge.

  1. The first problem is the impact of the outliers on the computation of the estimated location and scatter (and of the quantile function of the unknown uncontaminated distribution). This problem can be solved by using robust estimators (see (Tyler et al. 2009)) but is not considered further in the present report.

  2. The second problem is that this method implicitly assumes that outliers are the observations deviating most from the average. If outliers were defined according to another model, such as the elliptical mixture model, it would produce false positives (observations that were part of the uncontaminated distribution are detected as outliers) or false negatives (observations that were part of the outlier clouds are not detected).

1.1.4 The Invariant Coordinate Selection method

The general idea of ICS is to exploit the differences between two scatter operators evaluated at \(\mathbf{X}\) to deduce a lack of ellipticity of the contaminated distribution, and then find projection directions that helps highlighting the outliers.

Code
plot_data(
  list(elliptical, mixture),
  grid = TRUE,
  cov_ellipse = TRUE,
  cov4_ellipse = TRUE,
  width = 8,
  height = 3.5
)
Figure 1.5: Computation of \(\mathop{\mathrm{COV}}\) (blue) and \(\mathop{\mathrm{COV}}_4\) (red) ellipses for an elliptical (left) and a non-elliptical (right) distribution.

Figure 1.5 illustrates the fact that for a Gaussian distribution the two scatter matrices \(\mathbb{V}\) and \(\mathbb{V}_4\) are the same (left panel). But when considering a contaminated distribution, such as the one of Figure 1.1, the two scatter matrices differ as can be noticed on the right panel of Figure 1.5 where the ellipse associated with \(\mathbb{V}_4\) (in red) is elongated in the area where the outliers lie, compared to the ellipse associated with \(\mathbb{V}\) (in blue).
For distribution functions admitting the first four moments, the most common choice as the scatter pair is indeed \(\left(\mathbb{V}, \mathbb{V}_4\right)\), which corresponds to the Fourth Order Blind Identification (FOBI) algorithm known since 1989 in the Blind Source Separation literature (see (Cardoso 1989)).
Generally speaking, comparing two estimated scatter matrices: \[\left(\hat{\mathop{\mathrm{S}}}_1, \hat{\mathop{\mathrm{S}}}_2\right) = \left(\hat{S}_1\left(\mathbf{X}_1, \dots, \mathbf{X}_n\right), \hat{S}_2 \left(\mathbf{X}_1, \dots, \mathbf{X}_n\right)\right)\] is done by simultaneous diagonalization, which is closely related to the diagonalization of \({\hat{\mathop{\mathrm{S}}}_1}^{-1} \hat{\mathop{\mathrm{S}}}_2\) (as we will see at the end of this paragraph).
In order to do this, the natural idea is to symmetrize \({\hat{\mathop{\mathrm{S}}}_1}^{-1} \hat{\mathop{\mathrm{S}}}_2\) through conjugation by \({\hat{\mathop{\mathrm{S}}}_1}^{-\frac{1}{2}}\) which results in the similar positive-definite matrix \({\mathop{\mathrm{A}}= \hat{\mathop{\mathrm{S}}}_1}^{-\frac{1}{2}} \hat{\mathop{\mathrm{S}}}_2 \, {\hat{\mathop{\mathrm{S}}}_1}^{-\frac{1}{2}}\) that can be diagonalized as usual: \[\mathop{\mathrm{Q}}^\top \mathop{\mathrm{A}}\, \mathop{\mathrm{Q}}= \Delta\] where \(\mathop{\mathrm{Q}}\) is an orthogonal matrix and \(\Delta\) is diagonal with decreasing eigenvalues \(\left(\rho_1, \dots, \rho_D\right)\), so that when defining the non-singular matrix \(\mathop{\mathrm{H}}= {\hat{\mathop{\mathrm{S}}}_1}^{-\frac{1}{2}} \mathop{\mathrm{Q}}\): \[\mathop{\mathrm{H}}^\top \hat{\mathop{\mathrm{S}}}_2 \mathop{\mathrm{H}}= \Delta \text{ and } \mathop{\mathrm{H}}^\top \hat{\mathop{\mathrm{S}}}_1 \mathop{\mathrm{H}}= \mathop{\mathrm{I}}_D\]

Remark. Since \(\mathop{\mathrm{H}}\) is not orthogonal, this is not a proper diagonalization of matrices, but must be understood in terms of bilinear forms associated to the matrices \(\hat{\mathop{\mathrm{S}}}_1\) and \({\hat{\mathop{\mathrm{S}}}_2}\): this simultaneous diagonalization can be reinterpreted as replacing the canonical inner product by that associated to \(\hat{\mathop{\mathrm{S}}}_1\) (for which \(\mathop{\mathrm{H}}\) is now orthogonal and \({\hat{\mathop{\mathrm{S}}}}_2\) is still positive-definite).

What is gained compared to the Mahalanobis transformation is that after performing the change of inner product according to a scatter operator, another scatter is diagonalized according to the new inner product.

1.1.4.1 Computing the Invariant Coordinates

Code
.BDDSegXvol_ICS <- ics2(BDDSegXvol)

BDDSegXvol_ICS_scores <- .BDDSegXvol_ICS@Scores

plot_data(BDDSegXvol_ICS_scores, col = "Blues")
Figure 1.6: Invariant Coordinates of the BDDSegX data set (volumes), and the correlation matrix of the original data with the transformed data.

The change of \(\hat{\mathop{\mathrm{S}}}_1\)-orthogonal basis: \[\left(\begin{array}{c|c|c} \mathbf{Z}_1 & \dots & \mathbf{Z}_n \end{array}\right) = \mathop{\mathrm{H}}^\top \left(\begin{array}{c|c|c} \mathbf{X}_1 - \hat{m} & \dots & \mathbf{X}_n - \hat{m} \end{array}\right)\] is the first step of the method, and the coordinates of the \(\mathbf{Z}_j, 1 \leq j \leq n\) are called ‘Invariant Coordinates’ because even though \(\mathop{\mathrm{H}}\) is not unique (as eigenvectors are not unique), once chosen a way to compute one solution \(\mathop{\mathrm{H}}\), the transformed data set is affine invariant up to the sign of the \(\mathbf{Z}_j, 1 \leq j \leq n\).

1.1.4.2 Selecting the Principal Components

Code
BDDSegXvol_ICS_select <- BDDSegXvol_ICS_scores[c(1, 2)]
BDDSegX_vol_ICS_eigenvalues <- t(data.frame(.BDDSegXvol_ICS@gKurt))
colnames(BDDSegX_vol_ICS_eigenvalues) <- colnames(BDDSegXvol_ICS_scores)

plot_data(BDDSegXvol_ICS_select,
  grid = TRUE,
  col = "Blues",
)
Figure 1.7: Eigenvalues of \({\mathbb{V}\left[\mathbf{X}\right]}^{-1} \mathbb{V}_4\left[\mathbf{X}\right]\) and projection on \(\mathtt{IC.1}\) and \(\mathtt{IC.2}\) of The Invariant Coordinates of the BDDSegX data set (volumes).

According to the value of the spectrum \(\left(\rho_1, \dots, \rho_D\right)\) of \({\hat{\mathop{\mathrm{S}}}_1}^{-1} \hat{\mathop{\mathrm{S}}}_2\), we determine the components on which outliers must be looked for.
There are several ways to proceed, but the general idea is that since the spectrum is a measure of kurtosis of the distribution in various directions, we must keep the directions where the eigenvalues are substantially different from 1 (which means the distribution departs from an elliptical distributions in these directions). Eigenvalues larger than 1 are usually associated with invariant coordinates that help to detect small clusters of outliers, while eigenvalues smaller than 1 are associated with large clusters of outliers (see Theorem 3 from (Tyler et al. 2009) for a theoretical justification of the method under a particular Gaussian mixture model).
The ICSOutlier R package by Archimbaud, Nordhausen, and Ruiz-Gazen (2018b) suggests using normality tests such as D’Agostino’s tests, which are based on estimation of the skewness and the kurtosis of the invariant components.
This is the step where the ICS method differs from the method presented in Section 1.1.3. Without projecting on a strict subspace, the next step will be identical to the Mahalanobis criterion for outlier detection.

1.1.4.3 Detecting the outliers

Code
.BDDSegXvol_ICS_outliers <- ics.outlier(.BDDSegXvol_ICS)@ics.distances

BDDSegXvol_ICS_distances <- data.frame(cbind(BDDSegX$Date, .BDDSegXvol_ICS_outliers))
colnames(BDDSegXvol_ICS_distances) <- c("Date", "ICS distances")
BDDSegXvol_ICS_distances$Date <- as.Date(BDDSegX$Date)

plot_data(
  BDDSegXvol_ICS_distances,
  grid = TRUE,
  col = "Blues",
  asp = 0
)
Figure 1.8: Outliers in selected Invariant Coordinates of the BDDSegX data set (volumes).

Just as for the Mahalanobis distance, outliers can be detected by computing the ICS distances (i.e. the norms of the transformed observations).

Example 1.4 Let us illustrate shortly the ICS procedure on the data set BDDSegX. On Figure 1.6, we can see that some observations look quite far from the main bulk of the data on the first component \(\mathtt{IC.1}\) (see also the zoomed-in Figure 1.7). And this component to an eigenvalue larger than 1. Plotting the ICS distances calculated with \(\mathtt{IC.1}\) leads to Figure 1.8 where outliers are all identified around 2010. The correlations of \(\mathtt{IC.1}\) with original volumes are given on Figure 1.6. \(\mathtt{IC.1}\) is positively correlated with the small volumes \(\mathtt{V_A}\) and \(\mathtt{V_B}\) meaning that because of the economic crisis of 2008, people have been more inclined to buy small cars.

1.2 Compositional data analysis

This section summarises the basics of compositional linear algebra and data analysis, as presented in (Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado 2015).

1.2.1 Simplex of compositions, structure and isometries

Definition 1.3 (Simplex \(\mathcal{S}^D\)) We define \[\mathcal{S}^D = \left\{\left(x_1, \dots, x_D\right)^\top \in {\mathbb{R}_+^*}^D, \, \sum_{1\leq i \leq D} x_i = 1\right\}\] as the simplex in dimension \(D\), and its elements are called compositions.

Code
.simplex <- acomp(data.frame(matrix(1, 1, 3)))
names(.simplex) <- c("X_1", "X_2", "X_3")
plot_data(.simplex,
  grid = TRUE,
  width = 3,
  height = 3,
)
Figure 1.9: The triangle \(\mathcal{S}^3\).

It is equipped with the following structure, for \(\mathbf{x}\) and \(\mathbf{y}\) in \(\mathcal{S}^D\) and \(\alpha\) in \(\mathbb{R}\):

  • \(\mathbf{x}\oplus \mathbf{y}= \mathcal{C}\left(\left(x_1 y_1, \dots, x_D y_D\right)^\top\right)\) (Aitchison perturbation)

  • If \(\alpha \odot \mathbf{x}= \mathcal{C}\left(\left({x_1}^\alpha, \dots, {x_D}^\alpha\right)^\top\right)\) (Aitchison powering)

  • \(\langle \mathbf{x},\mathbf{y} \rangle_a = \frac{1}{D} \sum_{1\leq i<j\leq D} \ln\left(\frac{x_i}{x_j}\right) \, \ln\left(\frac{y_i}{y_j}\right)\) (Aitchison inner product)

where \(\mathcal{C}\) is denotes the closure operator, which maps a \(\mathbf{z}\in {\mathbb{R}_+^*}^D\) to the unique composition which is collinear to \(\mathbf{z}\).

Due to the interpretation of compositions’ coordinates as ratios of a whole, the classical metric is not the most relevant. That explains why we define these particular operations, and use a logarithmic transformation to measure distances.
The next theorem is fundamental because it provides two kinds of isometries between the simplex and Euclidean spaces of dimension \(D-1\) that we will be using extensively; it was proved in Chapter 4 of the book by Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015).

Theorem 1.1 The simplex \(\mathcal{S}^D\) with Aitchison’s structure is an Euclidean space of dimension \(D-1\), thus it is isometric to any Euclidean space of dimension \(D-1\), for these particular isometries, respectively the ‘centered logratio’ and the ‘isometric logratio’:

  • \(\mathop{\mathrm{clr}}: \begin{array}{rcl} \mathcal{S}^D & \to & \mathop{\mathrm{\mathcal{H}}}= \ker\left\{\mathbf{x}\mapsto \sum_{i=1}^{i=D} x_i\right\} \\ \mathbf{x}& \mapsto & \left(\ln\left(\frac{x_1}{g(\mathbf{x})}\right), \dots, \ln\left(\frac{x_D}{g(\mathbf{x})}\right)\right)^\top \end{array}\), where \(g: \mathbf{x}\mapsto \sqrt[D]{\prod_{i=1}^{i=D} x_i}\)

  • \(\mathop{\mathrm{ilr}}_\mathcal{B}: \begin{array}{rcl} \mathcal{S}^D & \to & \mathbb{R}^{D-1} \\ \mathbf{x}= \sum_{i=1}^{i=D-1} {x_i}^* \mathbf{e}_i & \mapsto & \mathbf{x}^* = \left({x_1}^*, \dots, {x_{D-1}}^*\right) \end{array}\), where \(\mathcal{B}= \left(\mathbf{e}_1,\dots, \mathbf{e}_{D-1}\right)\) is a given orthonormal basis of \(\mathcal{S}^D\).

The \(\mathop{\mathrm{clr}}\) isometry is canonical in the sense that it does not require constructing an orthonormal basis (the inner product on \(\mathcal{S}^D\) has been designed to match the \(\mathop{\mathrm{clr}}\) transformation), but it is not surjective from \(\mathcal{S}^D\) onto \(\mathbb{R}^{D-1}\).
On the contrary, the \(\mathop{\mathrm{ilr}}_\mathcal{B}\) isometries are indeed bijective from \(\mathcal{S}^D\) to \(\mathbb{R}^{D-1}\), but they are not canonical since there is no natural choice for an orthonormal basis of \(\mathcal{S}^D\).
We must then verify that every construction we make using the \(\mathop{\mathrm{ilr}}_\mathcal{B}\) transformation is independent of the basis \(\mathcal{B}\) and also meaningful when considered directly on the simplex. The contrast matrices will be a helpful tool especially to prove this kind of results:

Definition 1.4 For a basis \(\mathcal{B}= \left(\mathbf{e}_1, \dots, \mathbf{e}_{D-1}\right)\) of the simplex \(\mathcal{S}^D\), we define the contrast matrix \(\mathop{\mathrm{V}}_\mathcal{B}\) as the matrix associated to the linear mapping \(\mathop{\mathrm{clr}}\) from basis \(\mathcal{B}\) to the canonical basis of \(\mathbb{R}^D\), i.e. the matrix whose columns are the \(\mathop{\mathrm{clr}}\left(\mathbf{e}_i\right), 1 \leq i \leq D-1\).

Proposition 1.3 Let \(\mathop{\mathrm{V}}\) be the contrast matrix of an orthonormal basis \(\mathcal{B}\).

  • \(\mathop{\mathrm{V}}\) has dimensions \(D \times \left(D-1\right)\) and its rank is \(D-1\).

  • We have: \(\mathop{\mathrm{V}}^\top \mathop{\mathrm{V}}= \mathop{\mathrm{I}}_{D-1}\) and \(\mathop{\mathrm{V}}\mathop{\mathrm{V}}^\top = \mathop{\mathrm{I}}_D - \frac{1}{D} \mathbf{1}_D {\mathbf{1}_D}^\top\), where \(\mathbf{1}_D = \left(1, \dots, 1\right)^\top \in \mathbb{R}^D\).

  • For all \(\mathbf{x}\) in \(\mathcal{S}^D\), we have: \(\mathop{\mathrm{clr}}\left(\mathbf{x}\right) = \mathop{\mathrm{V}}\, \mathop{\mathrm{ilr}}\left(\mathbf{x}\right)\) and \(\mathop{\mathrm{ilr}}\left(\mathbf{x}\right) = \mathop{\mathrm{V}}^\top \mathop{\mathrm{clr}}\left(\mathbf{x}\right)\).

Note that there also exists another transformation called \(\mathop{\mathrm{alr}}\):

Definition 1.5 \[\mathop{\mathrm{alr}}: \begin{array}{rcl} \mathcal{S}^D & \to & \mathbb{R}^{D-1} \\ \mathbf{x}& \mapsto & \left(\ln\left(\frac{x_1}{x_D}\right), \dots, \ln\left(\frac{x_{D-1}}{x_D}\right)\right)^\top \end{array}\]

We define the \(\left(D-1\right) \times D\)-matrix \(\mathop{\mathrm{F}}\) and the \(D \times \left(D-1\right)\)-matrix \(\mathop{\mathrm{K}}\) by: \[\mathop{\mathrm{F}}= \left(\begin{array}{c|c} \mathop{\mathrm{I}}_{D-1} & -\mathbf{1}_{D-1} \end{array}\right) \text{ and } \mathop{\mathrm{K}}= \left( \begin{array}{c} \mathop{\mathrm{I}}_{D-1} - \frac{1}{D} \mathbf{1}_{D-1} {\mathbf{1}_{D-1}}^\top \\[6pt] -\frac{1}{D} {\mathbf{1}_{D-1}}^\top \end{array}\right)\]

Proposition 1.4  

  • We have: \(\mathop{\mathrm{F}}\mathop{\mathrm{K}}= \mathop{\mathrm{I}}_{D-1}\) and \(\mathop{\mathrm{K}}\mathop{\mathrm{F}}= \mathop{\mathrm{G}}_D = \mathop{\mathrm{I}}_{D} - \frac{1}{D} \mathbf{1}_{D} {\mathbf{1}_{D}}^\top\).

  • For all \(\mathbf{x}\) in \(\mathcal{S}^D\), we have: \(\mathop{\mathrm{alr}}\left(\mathbf{x}\right) = \mathop{\mathrm{F}}\, \mathop{\mathrm{clr}}\left(\mathbf{x}\right) \text{ and } \mathop{\mathrm{clr}}\left(\mathbf{x}\right) = \mathop{\mathrm{K}}\, \mathop{\mathrm{alr}}\left(\mathbf{x}\right)\).

The \(\mathop{\mathrm{alr}}\) transformation is a linear mapping, but is not an isometry, and is not symmetric in the coordinates. For these reasons, we will not be able to use it in this report. However, it important to note that it is the easiest one to compute.

1.2.2 Matrix-vector product on the simplex

The Invariant Coordinate Selection method is based on a linear transformation of the data set, so for our purposes we need to study the basic of linear algebra on the simplex.
As usual, in order to get a matrix description of linear mappings on a vector space we first need to define a matrix-vector product.
But in the case of the simplex, this leads naturally to defining two matrix-vector products, since when adapting the classical product to simplex operations, we lose the commutative property of the multiplication. In this report, we will only use the right-side product, which is defined for a \(D \times D\)-matrix \(\mathop{\mathrm{A}}\) and a vector \(\mathbf{x}\in\mathcal{S}^D\) by: \[\mathop{\mathrm{A}}\boxdot\, \mathbf{x}= \mathcal{C}\left(\left(\prod_{j = 1}^{j = D} x_j^ {a_{1,j}}, \dots, \prod_{j = 1}^{j = D} x_j^ {a_{D,j}}\right)^\top\right)\]

The relevance of this new product is given by the following proposition, which is Proposition 11.3.9 in (Egozcue et al. 2011).

Proposition 1.5 For every \(\mathop{\mathrm{A}}\in \mathbb{R}^{D \times D}\) and \(\mathbf{x}\in \mathcal{S}^D\) : \[\mathop{\mathrm{clr}}\left(\mathop{\mathrm{A}}\boxdot \, \mathbf{x}\right) = \mathop{\mathrm{A}}\mathop{\mathrm{clr}}\left(\mathbf{x}\right)\]

This compatibility with the \(\mathop{\mathrm{clr}}\) isometry explains why the compositional matrix-vector product \(\boxdot\) will be useful to describe the algebra \(\mathcal{L}\left(\mathcal{S}^D\right)\) of linear mappings of the simplex \(\mathcal{S}^D\), provided that it is linear (which, as explained in (Egozcue et al. 2011) is not the case for every matrix \(\mathop{\mathrm{A}}\in \mathbb{R}^{D \times D}\)). We will come back to this in Section 2.1.

1.2.3 Statistics on the simplex

Through an \(\mathop{\mathrm{ilr}}\) isometry, we can transport the Borel \(\sigma\)-field of \(\mathbb{R}^{D-1}\) and the Lebesgue measure to \(\mathcal{S}^D\), and thus construct random variables on the simplex (their distribution functions are defined as the push-forward measure of the distribution of \(\mathbf{X}^*\) on \(\mathbb{R}^{D-1}\)). Then we have to show that the constructed probability space and the random variables do not depend on the choice of isometry.

1.2.3.1 Centre & geometrical mean

Definition 1.6 The centre of a random variable \(\mathbf{X}\) on \(\mathcal{S}^D\) is defined as: \[\mathbf{g}\left[\mathbf{X}\right] = \mathop{\mathrm{clr}}^{-1} \left(\mathbb{E}\left[\mathop{\mathrm{clr}}\left(\mathbf{X}\right)\right]\right) = \mathop{\mathrm{ilr}}_\mathcal{B}^{-1} \left(\mathbb{E}\left[\mathop{\mathrm{ilr}}_\mathcal{B}\left(\mathbf{X}\right)\right]\right)\] for any orthonormal basis \(\mathcal{B}\) of \(\mathcal{S}^D\), provided that it verifies \(\mathbb{E}\left[\left\|\mathop{\mathrm{clr}}\left(\mathbf{X}\right)\right\|\right] < \infty\).

A consistent estimator of \(\mathbf{g}\left[\mathbf{X}\right]\) is given by the geometric mean : \[\hat{\mathbf{g}} = \tfrac{1}{n} \odot \left(\mathbf{X}_1 \oplus \dots \oplus \mathbf{X}_n\right)\]

1.2.3.2 Notions of variance on the simplex

In this subsection, we make a summary of the common notions of variance that can be defined on the simplex.
The first object (defined in particular in the R package compositions by van den Boogaart, Tolosana-Delgado, and Bren (2021)) will be the most useful when applying the ICS method to compositional data:

  • The (co-)variance \(\mathop{\mathrm{clr}}\)-matrix: \[\mathbb{V}\left[\mathbf{X}\right] = \mathbb{V}\left[\mathop{\mathrm{clr}}\left(\mathbf{X}\right)\right] = \mathop{\mathrm{V}}_\mathcal{B}\, \mathbb{V}\left[\mathop{\mathrm{ilr}}_\mathcal{B}\left(\mathbf{X}\right)\right] \, \mathop{\mathrm{V}}_\mathcal{B}^\top \text{ for any orthonormal basis } \mathcal{B}\]

The following three objects are defined by Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015). We considered using them when adapting ICS to compositional data, but due to their lack of interpretation as linear mappings we preferred the (co-)variance \(\mathop{\mathrm{clr}}\)-matrix.

  • Aitchison’s variation matrix: \[\mathop{\mathrm{T}}= \left(\mathbb{V}\left[\ln\left(4\right){\frac{X_i}{X_j}}\right]\right)_{1\leq i,j\leq D-1}\]

  • The variability according to a direction \(\mathbf{z}\in \mathcal{S}^D\): \[\mathop{\mathrm{var}}\left[\mathbf{X},\mathbf{z}\right] = \mathbb{E}\left[{\left\|\mathbf{X}\ominus \mathbf{z}\right\|_a}^2\right]\]

  • The total variance: \[\mathop{\mathrm{totvar}}\left[\mathbf{X}\right] = \min_{\mathbf{z}\in \mathcal{S}^D} \mathop{\mathrm{var}}\left[\mathbf{X},\mathbf{z}\right] = \mathbb{E}\left[{\left\|\mathbf{X}\ominus \mathbf{g}\left(\mathbf{X}\right)\right\|_a}^2\right]\]