Introduction to Functional Data Analysis

From a course by Prof. Camelia Goga

April 24, 2024

Plan for today

  1. Introduction
    Brief presentation of this type of data and the new challenges
    Why functional data?
  2. Preprocessing
    Curve smoothing in the presence of noise
  3. Exploratory analysis
    Descriptive statistics of functional data and functional PCA
  4. Regression
    The functional linear model

Introduction

A brief history

  • Older works: Deville (1974) (analysis of the family planning after the war) Dauxois, Pousse, and Romain (1982), Ramsay (1982)

  • …and a revival: Ramsay and Silverman (1997), etc.

  • A wide range of applications:

    • Economics: Kneip and Utikal (2001)
    • Climatology: Besse, Cardot, and Stephenson (2000)
    • Remote sensing: Cardot, Faivre, and Goulard (2003)
    • Bioinformatics
    • Medicine
  • A CRAN Task View

Examples

Example 1: Daily maximum temperatures in 2016 in Vietnam

Code
library(dda)
library(tidyverse)
library(ggplot2)
data(vietnam_temperature)
vntemp <- vietnam_temperature |>
  filter(year == 2016) |>
  mutate(t_max = lapply(t_max, unclass), id = seq_len(n())) |>
  unnest_longer(t_max, indices_to = "day")

vntemp |> ggplot(aes(day, t_max, group = id, color = region)) +
  geom_point() +
  geom_line() +
  xlab("Day number") +
  ylab("Temperature (°C)")

Figure 1: There are more discretisation points (365 or 366) than observations (63 provinces).

Example 2: Physical growth of Peruvian newborns

Code
library(refund)
data(content)
ggplot(content, aes(agedays, height, color = as.factor(id))) +
  geom_point(show.legend = FALSE) +
  geom_line(show.legend = FALSE) +
  xlab("Age (days)") +
  ylab("Height (cm)")

Figure 2: The discretisation points are not equally spaced and different for each curve. Source: CONTENT study, Sixth Framework Programme of the European Union.

Example 3: Near-infrared reflectance spectra gasoline samples

Code
library(refund)
data(gasoline)
wavelength <- as.numeric(gsub("\\D", "", colnames(gasoline$NIR)))
with(gasoline, matplot(wavelength, t(NIR),
  type = "b",
  xlab = "Wavelength (nm)", ylab = "log(1/reflectance)"
))

Figure 3: The x-axis is not time.

Mathematical and probabilistic modelling

  • The examples given above can be considered as realisations of a variable \(Y\) which will now be a function of \(t\in [0, T]:\) \[ Y=(Y(t))_{t\in [0, T]}; \]
  • We call functional random variable (random function), a variable \(Y\) defined on a probabilistic space \((\Omega, \mathcal K, \mathbb P)\) with values in a space of functions \(\mathcal F\) (vector space of infinite dimension): \[ Y:\Omega \longrightarrow \mathcal F \]
  • A realisation \(Y(\omega)=(Y(t,\omega))_{t\in [0, T]}\in \mathcal F,\) \(Y(t,\omega)\in \mathbb{R}\) will be called functional data.
  • For \(\omega\) fixed, \(Y(\cdot, \omega):[0, T]\longrightarrow \mathbb{R}\) trajectory of \(Y\) (deterministic);
  • For \(t\) fixed, \(Y(t, \cdot):\Omega \longrightarrow \mathbb{R}\) real random variable;

In practice

  • We observe a sample of random functions \(Y_1, \ldots, Y_n:\) \[ Y_i:\Omega\longrightarrow \mathcal F \] and we consider that \(Y_i, i=1, \ldots, n\) are iid.

  • We always observe the functions \(Y_i, i=1, \ldots, n\) in a finite (often large) number of discretisation points (which may be different from one curve to another): \[t_{i1}, \ldots, t_{iD_i}\subset [0, T]\] so the data is presented in vector form.

Why consider this functional aspect?

  • In practice, we observe the curves at discretised points, so we could consider that the objects \(Y_i\) are vectors and use the tools of multidimensional statistics.
  • However, this treatment will not take into account the “continuous” nature of the variables.
  • Multivariate statistics may prove ineffective, or even incapable, of dealing with the following situations:
    1. the discretization points are different from one individual to another (e.g. newborn growth curves);
    2. the measurement points are not equidistant (in time for example);
    3. the number of discretization points is greater than the number of individuals in the sample (e.g. temperature curves);

To overcome these difficulties,

  • points 1 and 2 can be dealt with by non-parametric smoothing of the observed data: the curve is reconstructed from the discretised points.
  • point 3 poses problems in the case of linear regression, ANOVA analysis, etc. and functional statistical analysis can provide answers in this case.

Preprocessing: Smoothing of noisy data

Non-parametric curve smoothing

  • In functional statistics, the objects of interest are functions: \[ Y=(Y(t))_{t \in [0, T]} \]

  • Therefore, in theory, \(Y\) could be measured at any time \(t \in [0, T]\).

  • Nevertheless, in practice:

    • \(Y\) is observed at discretised times \(t_1, \ldots, t_D\) which may be equidistant (temperature data) or not (newborn growth data);
    • measurements \(Y(t_d)\) may be subject to measurement errors (newborn growth study);
  • To overcome these difficulties, the true curve \(Y=(Y(t))_{t\in [0,T]}\) can be “reconstructed” from the discretised data using non-parametric smoothing methods.

Example 1: Daily maximum temperatures in 2016 in Vietnam

Code
library(fda)
y <- filter(vietnam_temperature, year == 2016 & province == "HANOI") |>
  pull(t_max) |>
  unlist()
breaks <- seq(1, length(y), length.out = 10)
bsp <- create.bspline.basis(breaks = breaks)
yhat <- smooth.basis(seq_len(length(y)), y, bsp)
invisible(
  plot(yhat, xlab = "Day number", ylab = "Temperature (°C)")
)
points(seq_len(length(y)), y)

Figure 4: Smoothing of daily maximum temperature in 2016 in Hanoi.

Non-parametric smoothing of a single curve

  • The objective is to reconstruct the curve \(Y=(Y(t))_{t\in [0, T]}\) from the observed data: \[ (t_d, Y_d), \quad d=1, \ldots, D \] where \(t_d\) are the discretised time instants and \(Y_d=Y(t_d), d=1, \ldots, t_D\).
  • The measurement points \(t_d\) can be equidistant and fixed in time or they can be random and selected in a continuous distribution of support \([0, T];\)
  • The measurements \(Y_d\) are often observed with measurement errors, so the aim of modelling is to pose a non-parametric model: \[ Y_d=f(t_d)+\varepsilon_d, \quad d=1, \ldots, D \] The time \(t\) (of values \(t_d\)) has in a way the role of the predictor of a classical non-parametric regression model.
  • We can use non-parametric methods like

    1. Kernel regression
    2. Local polynomials
    3. Spline regression

    to give an estimate \(\widehat f(t_d)\) of \(f(t)\);

  • The re-constructed curve is \(Y(t)=\widehat f(t)\) for all \(t \in [0, T]\).

Non-parametric smoothing in a basis of spline functions

  • Let \(b_1(\cdot), \ldots, b_q(\cdot)\) be a basis of spline functions (\(B\)-splines or truncated polynomials) of dimension \(q=K+m\) with \(K\) equidistant interior knots and piecewise polynomials of degree \(m-1\);

  • We denote by \(\mathbf{b}^T(t)=(\textit{b}_1(t), \ldots, b_q(t))\) the function vectors of the basis calculated at a point \(t\in [0, T]\) and by \(\mathbf{B}\) the matrix containing the values of \(\mathbf{b}^T(t)\) for the discretisation points \(t_d, d=1, \ldots, D\): \[ \mathbf{B}=\left(\begin{array}{c} \mathbf{b}^T(t_1)\\ \mathbf{b}^T(t_2)\\ \ldots\\ \mathbf{b}^T(t_D) \end{array}\right)=\left(\begin{array}{cccc} b_1(t_1) \quad \ldots \quad b_q(t_1)\\ b_1(t_2) \quad \ldots \quad b_q(t_2)\\ \ldots \quad \ldots \quad \ldots\\ b_1(t_D) \quad \ldots \quad b_q(t_D) \end{array}\right) \]

  • The estimator \(\widehat f(t_d)\) of \(f(t_d)\) is given by \[\begin{aligned} \widehat f(t_d) &= \mathbf{b}^T(t_d)\boldsymbol{\widehat{\theta}} \quad d=1, \ldots, D\quad \mbox{with}\\ \boldsymbol{\widehat{\theta}} &= \underset{\boldsymbol{\theta}\in \mathbb R^q}{\operatorname{argmin}} \sum_{d=1}^D(y_d-\mathbf{b}^T(t_d)\boldsymbol{\theta})^2=(\mathbf{B}^T\mathbf{B})^{-1}\mathbf{B}^T\mathbf{Y}_d \end{aligned}\] where \(\mathbf{Y}_d=(Y_d)_{d=1}^D\).

  • In the end, we have the curve reconstructed by non-parametric regression using spline functions: \[ \widehat Y(t)=\widehat f(t)=\mathbf{b}^T(t)\boldsymbol{\widehat\theta}=\mathbf{a}^T(t)\mathbf{Y}_d \quad \text{for all}\quad t\in [0, T]; \]

  • The number of knots \(K\) can be chosen using the cross-validation or generalised cross-validation criterion and \(m\) is small (3 or 4). The knots can be positioned at the quantiles of the instants \(t_1, \dots, t_D\).

Smoothing the sample curves using \(B\)-splines regression

  • We have a sample of \(n\) curves, \(Y_1, \ldots, Y_n\) which are, for simplicity, measured at the same discretisation times \(t_1, \ldots, t_D;\)

  • Let \(Y_i(t_d)=Y_{id}\) for \(i=1, \ldots, n\) and \(d=1, \ldots, D\).

  • Let \(n\) be non-parametric models: \[\begin{aligned} Y_{1d} & = f_1(t_d)+\varepsilon_{1d}, \quad d=1, \ldots, D \\ Y_{2d} & = f_2(t_d)+\varepsilon_{2d}, \quad d=1, \ldots, D \\ &\vdots \\ Y_{nd} & = f_n(t_d)+\varepsilon_{nd}, \quad d=1, \ldots, D \end{aligned}\]

  • The objective is to estimate \(f_i, i=1, \ldots, n\);

  • Let \(b_1(\cdot), \ldots, b_q(\cdot)\) be the basis of spline functions of order \(q=K+m\): \(K\) equidistant interior knots of degree \(m-1;\)

  • This same basis is used to re-construct all the \(n\) curves;

  • Using the method described for smoothing a single curve, we obtain: \[\begin{aligned} \widehat Y_i(t)= \widehat f_i(t) &= \mathbf{b}^T(t)\widehat{\boldsymbol{\theta}}_i \\ &= \mathbf{b}^T(t)(\mathbf{B}^T\mathbf{B})^{-1}\mathbf{B}^T\mathbf{Y}_i, \quad i=1, \ldots, n \end{aligned}\] where \(\mathbf{Y}_i=(Y_{id})_{d=1}^D;\)

  • The same method can be used in the case where the discretisation times are not the same for all the curves. In this situation, we consider the union of all the measurement instants and the knots are placed at the quantiles of this new set.

The smoothing parameter

  • Let \(\delta\) be the smoothing parameter used in the reconstruction of the curves \(Y_i, i=1, \ldots, n\); this parameter can be chosen using the generalised validation criterion \[ \widehat Y_{id}=\widehat Y_{i}(t_d)=\mathbf{b}^T(t_d)\widehat{\boldsymbol{\theta}}_i%=\mathbf{S}(\delta)\mathbf{Y}_i \] \[ \mathbf{S}(\delta)=\mathbf{B}(\mathbf{B}'\mathbf{B})^{-1}\mathbf{B}' \] \[ GCV_i(\delta)=\sum_{d=1}^D\frac{(Y_{id}- \widehat Y_{id})^2}{(1-\operatorname{trace}(\mathbf{S}(\delta))/n)^2}, \quad i=1, \ldots, n \]
  • \(GCV(\delta)=\frac{1}{n}\sum_{i=1}^nGCV_i(\delta)\)

Descriptive statistics and exploration

Descriptive statistics of functional data

  • The variables \(Y_1, \ldots, Y_n\) are now functions of \(t\in [0, T]:\) \[ Y_i=(Y_i(t))_{t\in [0, T]}, \] and we consider that they are independent and identically distributed;

  • We consider that the \(Y_i, i=1, \ldots, n\) belong to the Hilbert space \(L^2[0, T]\), the space of integrable square functions with the usual scalar product and the associated norm: \[ \langle f,g \rangle = \int_0^Tf(t)g(t)dt, \quad \|f \| =\sqrt{\int_0^Tf^2(t) \mathrm dt} \]

  • The processes (functions) \(Y_i\) are defined on a probability space \((\Omega, \mathcal K, \mathbb P)\) with values in \(L^2[0, T]:\) \[ Y_i(\omega)=(Y_i(t,\omega))_{t\in [0, T]}, \quad i=1, \ldots, n \]

Expected function and covariance operator

  • Assume that \(\mathbb E(\|Y\|)=\int_{\Omega}\|Y\|\mathrm d\mathbb P <\infty,\) then we can define: \[ \mathbb E(Y)=\int Y(\omega) \mathrm d\mathbb P(\omega)=\mu\quad \mbox{the expected value of } Y \]

Remark. This is not the Lebesgue integral but rather the Bochner integral (its generalisation to Banach spaces).

  • If \(\mathbb E(\|Y\|^2)=\int_{\Omega}\|Y\|^2 \mathrm d\mathbb P <\infty,\) then we can define: \[ \Gamma=\mathbb E(\left(Y - \mu\right)\otimes \left(Y - \mu\right))\quad \mbox{the covariance operator of } Y \] where the tensor product of two elements \(a\) and \(b\) of \(L^2[0,T]\) is the rank 1 operator defined by \(a \otimes b(u) = \langle a, u \rangle b,\) for \(u\) in \(L^2[0,T]\).
  • Under additional conditions on the process \(Y\) (continuous and of second order), we can then define the mean function \(\mathbb E(Y(t))\) and \[ \mathbb E(Y(t))=\mu(t), \quad t \quad \textbf{fixed} \] and for all \(t, s\) , the covariance function: \[ \gamma(s,t)=\operatorname{cov} (Y(s), Y(t))=\mathbb E\left((Y(s)-\mu(s))(Y(t)-\mu(t))\right) \] and \((\Gamma f)(t)=\int_0^T\gamma(t,s)f(s) \mathrm ds\) (\(\gamma\) is the kernel of \(\Gamma\)).

Empirical mean and empirical covariance operator

Let \(Y_i, i=1, \ldots, n\) be independent and identically distributed with \(Y\).

  • The mean (empirical) curve is given by \[ \mu_n=\frac{1}{n}\sum_{i=1}^nY_i, \quad \mu_n(t)=\frac{1}{n}\sum_{i=1}^nY_i(t). \]
  • The (empirical) covariance operator is given by:
    \[\begin{aligned} \Gamma_n &= \frac1n \sum_{k=1}^n \left(Y_k - \mu_n\right)\otimes \left(Y_k - \mu_n\right) = \frac1n \sum_{k=1}^n Y_k\otimes Y_k-\mu_n\otimes\mu_n \end{aligned}\]
  • The empirical mean \(\mu_n\) estimates \(\mu\) without bias and the corrected covariance \(\frac{n}{n-1}\Gamma_n\) estimates \(\Gamma\) without bias.

Example 3: Near-infrared reflectance spectra gasoline samples

Code
library(fda.usc)
with(gasoline, matplot(wavelength, t(NIR),
  type = "b",
  xlab = "Wavelength (nm)", ylab = "log(1/reflectance)"
))
nir <- fdata(gasoline$NIR, argvals = wavelength)
nir_mean <- mean(nir)
lines(nir_mean, lwd = 2)

Figure 5: Empirical mean of the NIR curves, computed pointwise.

Code
library(plotly)
nir_cov <- cov(nir$data)
plot_ly(x = wavelength, y = wavelength, z = nir_cov) |>
  layout(scene = list(
    xaxis = list(title = "Wavelength (nm)"),
    yaxis = list(title = "Wavelength (nm)"),
    zaxis = list(title = "Covariance")
  )) |>
  add_surface()