Outlier detection using ICS for compositional data
A multivariate data set for which each variable can be interpreted as part of a whole is called a compositional data set. In that case, it is preferable to decrease by one the dimension and consider it as embedded in the simplex, which is equipped with its own Euclidean structure.
Thus, adapting a method such as Invariant Coordinate Selection (ICS) to compositional data is an interesting challenge since ICS has proved itself useful on classical multivariate data sets to reveal hidden structures such as outliers or groups, but cannot be applied as such on the simplex. This internship report summarises the main aspects of compositional data analysis and algebra, in order to adapt the ICS method to outlier detection for compositional data.
Acknowledgments: I would like to express my gratitude to my internship supervisor Pr. Anne Ruiz-Gazen, for putting great effort into providing guidance whenever needed. I am also very grateful to Pr. Christine Thomas-Agnan and Thibault Laurent, with whom I have worked all along this internship.
Invariant Coordinate Selection, compositions, outliers, multivariate data analysis, kurtosis, automotive market
Introduction
Context & goals
This report summarises the advancements that have been made during my research internship at TSE-R, between May 2021 and August 2021, under the supervision of Pr. Anne Ruiz-Gazen, with the collaboration of Pr. Christine Thomas-Agnan and Thibault Laurent.
The aim of my internship was to prepare the foundation for the redaction of a publication on the following matter: ‘Outlier detection using the Invariant Coordinate Selection method for compositional data’.
The first step was to learn the basics of compositional algebra and data analysis, which are Christine Thomas-Agnan’s research interests; the second step was to understand the Invariant Coordinate Selection method, which relies on the concepts of scatter operators and kurtosis, and was conceived during Anne Ruiz-Gazen’s thesis; the third step was to adapt the latter to the former, and to develop a good sense of what could be adapted, what needed to be changed and what could not work when moving from general multivariate data to compositional data.
Application to the BDDSegX
data set
Once the foundations are laid, the idea was to apply the newly conceived method to a real data set (by designing - for instance - an R package called ICSCoDa
) to test the validity of the results that were found on this particular data set, and to interpret this compositional data set thanks to the Invariant Coordinate Selection method.
The BDDSegX
data set on the automotive market was chosen to illustrate the notions and compare the methods that were exposed. This data set was generated by Christine Thomas-Agnan and Joanna Morais and contains (among other variables) the monthly evolution from 2003 to 2015 of each volume of the five-segment car classification according to their size (from the A-segment regrouping city cars to the E-segment regrouping executive cars), with a total of 152 observations.
It is relevant to see this data set as compositional data, provided of course that one is not interested in studying the variation of the total volume of the automotive market.