Introduction to Functional Data Analysis

From a course by Prof. Camelia Goga

April 24, 2024

Plan for today

  1. Introduction
    Brief presentation of this type of data and the new challenges
    Why functional data?
  2. Preprocessing
    Curve smoothing in the presence of noise
  3. Exploratory analysis
    Descriptive statistics of functional data and functional PCA
  4. Regression
    The functional linear model

Introduction

A brief history

  • Older works: Deville (1974) (analysis of the family planning after the war) Dauxois, Pousse, and Romain (1982), Ramsay (1982)

  • …and a revival: Ramsay and Silverman (1997), etc.

  • A wide range of applications:

    • Economics: Kneip and Utikal (2001)
    • Climatology: Besse, Cardot, and Stephenson (2000)
    • Remote sensing: Cardot, Faivre, and Goulard (2003)
    • Bioinformatics
    • Medicine
  • A CRAN Task View

Examples

Example 1: Daily maximum temperatures in 2016 in Vietnam

Code
library(dda)
library(tidyverse)
library(ggplot2)
data(vietnam_temperature)
vntemp <- vietnam_temperature |>
  filter(year == 2016) |>
  mutate(t_max = lapply(t_max, unclass), id = seq_len(n())) |>
  unnest_longer(t_max, indices_to = "day")

vntemp |> ggplot(aes(day, t_max, group = id, color = region)) +
  geom_point() +
  geom_line() +
  xlab("Day number") +
  ylab("Temperature (°C)")

Figure 1: There are more discretisation points (365 or 366) than observations (63 provinces).

Example 2: Physical growth of Peruvian newborns

Code
library(refund)
data(content)
ggplot(content, aes(agedays, height, color = as.factor(id))) +
  geom_point(show.legend = FALSE) +
  geom_line(show.legend = FALSE) +
  xlab("Age (days)") +
  ylab("Height (cm)")

Figure 2: The discretisation points are not equally spaced and different for each curve. Source: CONTENT study, Sixth Framework Programme of the European Union.

Example 3: Near-infrared reflectance spectra gasoline samples

Code
library(refund)
data(gasoline)
wavelength <- as.numeric(gsub("\\D", "", colnames(gasoline$NIR)))
with(gasoline, matplot(wavelength, t(NIR),
  type = "b",
  xlab = "Wavelength (nm)", ylab = "log(1/reflectance)"
))

Figure 3: The x-axis is not time.

Mathematical and probabilistic modelling

  • The examples given above can be considered as realisations of a variable \(Y\) which will now be a function of \(t\in [0, T]:\) \[ Y=(Y(t))_{t\in [0, T]}; \]
  • We call functional random variable (random function), a variable \(Y\) defined on a probabilistic space \((\Omega, \mathcal K, \mathbb P)\) with values in a space of functions \(\mathcal F\) (vector space of infinite dimension): \[ Y:\Omega \longrightarrow \mathcal F \]
  • A realisation \(Y(\omega)=(Y(t,\omega))_{t\in [0, T]}\in \mathcal F,\) \(Y(t,\omega)\in \mathbb{R}\) will be called functional data.
  • For \(\omega\) fixed, \(Y(\cdot, \omega):[0, T]\longrightarrow \mathbb{R}\) trajectory of \(Y\) (deterministic);
  • For \(t\) fixed, \(Y(t, \cdot):\Omega \longrightarrow \mathbb{R}\) real random variable;

In practice

  • We observe a sample of random functions \(Y_1, \ldots, Y_n:\) \[ Y_i:\Omega\longrightarrow \mathcal F \] and we consider that \(Y_i, i=1, \ldots, n\) are iid.

  • We always observe the functions \(Y_i, i=1, \ldots, n\) in a finite (often large) number of discretisation points (which may be different from one curve to another): \[t_{i1}, \ldots, t_{iD_i}\subset [0, T]\] so the data is presented in vector form.

Why consider this functional aspect?

  • In practice, we observe the curves at discretised points, so we could consider that the objects \(Y_i\) are vectors and use the tools of multidimensional statistics.
  • However, this treatment will not take into account the “continuous” nature of the variables.
  • Multivariate statistics may prove ineffective, or even incapable, of dealing with the following situations:
    1. the discretization points are different from one individual to another (e.g. newborn growth curves);
    2. the measurement points are not equidistant (in time for example);
    3. the number of discretization points is greater than the number of individuals in the sample (e.g. temperature curves);

To overcome these difficulties,

  • points 1 and 2 can be dealt with by non-parametric smoothing of the observed data: the curve is reconstructed from the discretised points.
  • point 3 poses problems in the case of linear regression, ANOVA analysis, etc. and functional statistical analysis can provide answers in this case.

Preprocessing: Smoothing of noisy data

Non-parametric curve smoothing

  • In functional statistics, the objects of interest are functions: \[ Y=(Y(t))_{t \in [0, T]} \]

  • Therefore, in theory, \(Y\) could be measured at any time \(t \in [0, T]\).

  • Nevertheless, in practice:

    • \(Y\) is observed at discretised times \(t_1, \ldots, t_D\) which may be equidistant (temperature data) or not (newborn growth data);
    • measurements \(Y(t_d)\) may be subject to measurement errors (newborn growth study);
  • To overcome these difficulties, the true curve \(Y=(Y(t))_{t\in [0,T]}\) can be “reconstructed” from the discretised data using non-parametric smoothing methods.

Example 1: Daily maximum temperatures in 2016 in Vietnam

Code
library(fda)
y <- filter(vietnam_temperature, year == 2016 & province == "HANOI") |>
  pull(t_max) |>
  unlist()
breaks <- seq(1, length(y), length.out = 10)
bsp <- create.bspline.basis(breaks = breaks)
yhat <- smooth.basis(seq_len(length(y)), y, bsp)
invisible(
  plot(yhat, xlab = "Day number", ylab = "Temperature (°C)")
)
points(seq_len(length(y)), y)

Figure 4: Smoothing of daily maximum temperature in 2016 in Hanoi.

Non-parametric smoothing of a single curve

  • The objective is to reconstruct the curve \(Y=(Y(t))_{t\in [0, T]}\) from the observed data: \[ (t_d, Y_d), \quad d=1, \ldots, D \] where \(t_d\) are the discretised time instants and \(Y_d=Y(t_d), d=1, \ldots, t_D\).
  • The measurement points \(t_d\) can be equidistant and fixed in time or they can be random and selected in a continuous distribution of support \([0, T];\)
  • The measurements \(Y_d\) are often observed with measurement errors, so the aim of modelling is to pose a non-parametric model: \[ Y_d=f(t_d)+\varepsilon_d, \quad d=1, \ldots, D \] The time \(t\) (of values \(t_d\)) has in a way the role of the predictor of a classical non-parametric regression model.
  • We can use non-parametric methods like

    1. Kernel regression
    2. Local polynomials
    3. Spline regression

    to give an estimate \(\widehat f(t_d)\) of \(f(t)\);

  • The re-constructed curve is \(Y(t)=\widehat f(t)\) for all \(t \in [0, T]\).

Non-parametric smoothing in a basis of spline functions

  • Let \(b_1(\cdot), \ldots, b_q(\cdot)\) be a basis of spline functions (\(B\)-splines or truncated polynomials) of dimension \(q=K+m\) with \(K\) equidistant interior knots and piecewise polynomials of degree \(m-1\);

  • We denote by \(\mathbf{b}^T(t)=(\textit{b}_1(t), \ldots, b_q(t))\) the function vectors of the basis calculated at a point \(t\in [0, T]\) and by \(\mathbf{B}\) the matrix containing the values of \(\mathbf{b}^T(t)\) for the discretisation points \(t_d, d=1, \ldots, D\): \[ \mathbf{B}=\left(\begin{array}{c} \mathbf{b}^T(t_1)\\ \mathbf{b}^T(t_2)\\ \ldots\\ \mathbf{b}^T(t_D) \end{array}\right)=\left(\begin{array}{cccc} b_1(t_1) \quad \ldots \quad b_q(t_1)\\ b_1(t_2) \quad \ldots \quad b_q(t_2)\\ \ldots \quad \ldots \quad \ldots\\ b_1(t_D) \quad \ldots \quad b_q(t_D) \end{array}\right) \]

  • The estimator \(\widehat f(t_d)\) of \(f(t_d)\) is given by \[\begin{aligned} \widehat f(t_d) &= \mathbf{b}^T(t_d)\boldsymbol{\widehat{\theta}} \quad d=1, \ldots, D\quad \mbox{with}\\ \boldsymbol{\widehat{\theta}} &= \underset{\boldsymbol{\theta}\in \mathbb R^q}{\operatorname{argmin}} \sum_{d=1}^D(y_d-\mathbf{b}^T(t_d)\boldsymbol{\theta})^2=(\mathbf{B}^T\mathbf{B})^{-1}\mathbf{B}^T\mathbf{Y}_d \end{aligned}\] where \(\mathbf{Y}_d=(Y_d)_{d=1}^D\).

  • In the end, we have the curve reconstructed by non-parametric regression using spline functions: \[ \widehat Y(t)=\widehat f(t)=\mathbf{b}^T(t)\boldsymbol{\widehat\theta}=\mathbf{a}^T(t)\mathbf{Y}_d \quad \text{for all}\quad t\in [0, T]; \]

  • The number of knots \(K\) can be chosen using the cross-validation or generalised cross-validation criterion and \(m\) is small (3 or 4). The knots can be positioned at the quantiles of the instants \(t_1, \dots, t_D\).

Smoothing the sample curves using \(B\)-splines regression

  • We have a sample of \(n\) curves, \(Y_1, \ldots, Y_n\) which are, for simplicity, measured at the same discretisation times \(t_1, \ldots, t_D;\)

  • Let \(Y_i(t_d)=Y_{id}\) for \(i=1, \ldots, n\) and \(d=1, \ldots, D\).

  • Let \(n\) be non-parametric models: \[\begin{aligned} Y_{1d} & = f_1(t_d)+\varepsilon_{1d}, \quad d=1, \ldots, D \\ Y_{2d} & = f_2(t_d)+\varepsilon_{2d}, \quad d=1, \ldots, D \\ &\vdots \\ Y_{nd} & = f_n(t_d)+\varepsilon_{nd}, \quad d=1, \ldots, D \end{aligned}\]

  • The objective is to estimate \(f_i, i=1, \ldots, n\);

  • Let \(b_1(\cdot), \ldots, b_q(\cdot)\) be the basis of spline functions of order \(q=K+m\): \(K\) equidistant interior knots of degree \(m-1;\)

  • This same basis is used to re-construct all the \(n\) curves;

  • Using the method described for smoothing a single curve, we obtain: \[\begin{aligned} \widehat Y_i(t)= \widehat f_i(t) &= \mathbf{b}^T(t)\widehat{\boldsymbol{\theta}}_i \\ &= \mathbf{b}^T(t)(\mathbf{B}^T\mathbf{B})^{-1}\mathbf{B}^T\mathbf{Y}_i, \quad i=1, \ldots, n \end{aligned}\] where \(\mathbf{Y}_i=(Y_{id})_{d=1}^D;\)

  • The same method can be used in the case where the discretisation times are not the same for all the curves. In this situation, we consider the union of all the measurement instants and the knots are placed at the quantiles of this new set.

The smoothing parameter

  • Let \(\delta\) be the smoothing parameter used in the reconstruction of the curves \(Y_i, i=1, \ldots, n\); this parameter can be chosen using the generalised validation criterion \[ \widehat Y_{id}=\widehat Y_{i}(t_d)=\mathbf{b}^T(t_d)\widehat{\boldsymbol{\theta}}_i%=\mathbf{S}(\delta)\mathbf{Y}_i \] \[ \mathbf{S}(\delta)=\mathbf{B}(\mathbf{B}'\mathbf{B})^{-1}\mathbf{B}' \] \[ GCV_i(\delta)=\sum_{d=1}^D\frac{(Y_{id}- \widehat Y_{id})^2}{(1-\operatorname{trace}(\mathbf{S}(\delta))/n)^2}, \quad i=1, \ldots, n \]
  • \(GCV(\delta)=\frac{1}{n}\sum_{i=1}^nGCV_i(\delta)\)

Descriptive statistics and exploration

Descriptive statistics of functional data

  • The variables \(Y_1, \ldots, Y_n\) are now functions of \(t\in [0, T]:\) \[ Y_i=(Y_i(t))_{t\in [0, T]}, \] and we consider that they are independent and identically distributed;

  • We consider that the \(Y_i, i=1, \ldots, n\) belong to the Hilbert space \(L^2[0, T]\), the space of integrable square functions with the usual scalar product and the associated norm: \[ \langle f,g \rangle = \int_0^Tf(t)g(t)dt, \quad \|f \| =\sqrt{\int_0^Tf^2(t) \mathrm dt} \]

  • The processes (functions) \(Y_i\) are defined on a probability space \((\Omega, \mathcal K, \mathbb P)\) with values in \(L^2[0, T]:\) \[ Y_i(\omega)=(Y_i(t,\omega))_{t\in [0, T]}, \quad i=1, \ldots, n \]

Expected function and covariance operator

  • Assume that \(\mathbb E(\|Y\|)=\int_{\Omega}\|Y\|\mathrm d\mathbb P <\infty,\) then we can define: \[ \mathbb E(Y)=\int Y(\omega) \mathrm d\mathbb P(\omega)=\mu\quad \mbox{the expected value of } Y \]

Remark. This is not the Lebesgue integral but rather the Bochner integral (its generalisation to Banach spaces).

  • If \(\mathbb E(\|Y\|^2)=\int_{\Omega}\|Y\|^2 \mathrm d\mathbb P <\infty,\) then we can define: \[ \Gamma=\mathbb E(\left(Y - \mu\right)\otimes \left(Y - \mu\right))\quad \mbox{the covariance operator of } Y \] where the tensor product of two elements \(a\) and \(b\) of \(L^2[0,T]\) is the rank 1 operator defined by \(a \otimes b(u) = \langle a, u \rangle b,\) for \(u\) in \(L^2[0,T]\).
  • Under additional conditions on the process \(Y\) (continuous and of second order), we can then define the mean function \(\mathbb E(Y(t))\) and \[ \mathbb E(Y(t))=\mu(t), \quad t \quad \textbf{fixed} \] and for all \(t, s\) , the covariance function: \[ \gamma(s,t)=\operatorname{cov} (Y(s), Y(t))=\mathbb E\left((Y(s)-\mu(s))(Y(t)-\mu(t))\right) \] and \((\Gamma f)(t)=\int_0^T\gamma(t,s)f(s) \mathrm ds\) (\(\gamma\) is the kernel of \(\Gamma\)).

Empirical mean and empirical covariance operator

Let \(Y_i, i=1, \ldots, n\) be independent and identically distributed with \(Y\).

  • The mean (empirical) curve is given by \[ \mu_n=\frac{1}{n}\sum_{i=1}^nY_i, \quad \mu_n(t)=\frac{1}{n}\sum_{i=1}^nY_i(t). \]
  • The (empirical) covariance operator is given by:
    \[\begin{aligned} \Gamma_n &= \frac1n \sum_{k=1}^n \left(Y_k - \mu_n\right)\otimes \left(Y_k - \mu_n\right) = \frac1n \sum_{k=1}^n Y_k\otimes Y_k-\mu_n\otimes\mu_n \end{aligned}\]
  • The empirical mean \(\mu_n\) estimates \(\mu\) without bias and the corrected covariance \(\frac{n}{n-1}\Gamma_n\) estimates \(\Gamma\) without bias.

Example 3: Near-infrared reflectance spectra gasoline samples

Code
library(fda.usc)
with(gasoline, matplot(wavelength, t(NIR),
  type = "b",
  xlab = "Wavelength (nm)", ylab = "log(1/reflectance)"
))
nir <- fdata(gasoline$NIR, argvals = wavelength)
nir_mean <- mean(nir)
lines(nir_mean, lwd = 2)

Figure 5: Empirical mean of the NIR curves, computed pointwise.

Code
library(plotly)
nir_cov <- cov(nir$data)
plot_ly(x = wavelength, y = wavelength, z = nir_cov) |>
  layout(scene = list(
    xaxis = list(title = "Wavelength (nm)"),
    yaxis = list(title = "Wavelength (nm)"),
    zaxis = list(title = "Covariance")
  )) |>
  add_surface()
Figure 6: Empirical covariance function of the NIR curves, computed pointwise.

Properties of the expected value

  1. Under the preceding conditions: \[ \mathbb E(\langle Y, f \rangle)=\langle \mathbb E(Y), f\rangle \quad \mbox{for all}\quad f\in L^2[0, T]; \] so the integral and the scalar product commute.

  2. “Pythagorean theorem”: \[ \mathbb E\|Y-\mu\|^2=\mathbb E\|Y\|^2-\mu^2 \]

Properties of the covariance operator

The \(\Gamma\) operator has the following properties (assuming that \(\mathbb E(\|Y\|^2)<\infty\)):

  1. \(\displaystyle \langle \Gamma f, g\rangle=\mathbb E\left(\langle Y-\mu, f\rangle\langle Y-\mu, g\rangle\right)\) for all \(f, g\in L^2[0, T];\)

  2. \(\Gamma=\mathbb E (Y\otimes Y)-\mu\otimes\mu\)

  3. it is a self-adjoint operator (\(\Gamma=\Gamma^*\), where \(\Gamma^*\) is the adjoint operator) because the kernel \(\gamma(\cdot, \cdot)\) is symmetric and non-negative (\(\langle \Gamma u, u \rangle \geq 0\));

  4. it has finite trace: \(\|\Gamma\|_{TR}=\sum_{j\geq 1}\langle (\Gamma^*\Gamma)^{1/2} e_j, e_j\rangle=\mathbb E\|Y\|^2<\infty\);

  5. is a Hilbert-Schmidt operator: \[|\Gamma\|^2_{HS}=\sum_{j\geq 1}\|\Gamma e_j\|^2<\infty,\]

where \(\{e_j\}_{j\geq 1}\) is any orthonormal basis of \(L^2[0, T]\).

Spectral decomposition of the covariance operator

  • Let \((\lambda_j, v_j)_{j\leq 1}\) with \(\lambda_j\in \mathbb R\) and \(v_j=(v_j(t))_{t\in [0, T]}\in L^2[0, T]\) be called eigenvalues and associated eigenvectors (functions) of \(\Gamma\): \[\begin{aligned} \Gamma v_j &= \lambda_j \ v_j\\ \Gamma v_j (t)& = \lambda_j \ v_j(t), \quad t \in [0,T], \quad j\geq 1 \end{aligned}\]

  • \(\Gamma\) is non-negative so \(\lambda_j\geq 0\) for all \(j\geq 1;\)

  • \(\Gamma\) is self-adjoint, so the eigenvectors associated with the distinct eigenvalues are linearly independent and orthogonal;

The spectral theorem

  • The operator \(\Gamma\) is compact and self-adjoint so the set of non-zero eigenvalues is finite or is a sequence which converges to zero. Let \((\lambda_j)_{j\geq 1}\) be the ordered eigenvalues: \(\lambda_1\geq\lambda_2\geq\ldots>0\). The associated eigenvectors \((v_j)_{j\geq 1}\) form an orthonormal basis in \(L^2[0,T]\) for \(\overline{\operatorname{Im} (\gamma)}\) i.e \(\langle v_j, v_{j'} \rangle = \delta_{jj'}\). The covariance then has the following decomposition: \[ \Gamma=\sum_{j\geq 1}\lambda_jv_j\otimes v_j \quad \mbox{et}\quad \Gamma f=\sum_{j\geq 1}\lambda_j \langle f,v_j\rangle v_j; \]

  • The covariance function \(\gamma\) associated with \(\Gamma\) has the following decomposition: \[ \gamma(s,t)=\sum_{j\geq 1}\lambda_jv_j(s) v_j(t) \quad \mbox{for any}\quad s,t\in [0, T]; \]

Consequences

  • \(Y-\mu=\sum_{j\geq 1}\langle Y-\mu, v_j\rangle v_j\) and \(\mathbb E(\langle Y-\mu, v_j\rangle v_j)=0;\)

  • The trace norm of \(\Gamma\) is given by \[ \|\Gamma\|_{TR}=\sum_{j\geq 1}\langle \Gamma v_j, v_j\rangle=\sum_{j\geq 1} \lambda_j<\infty \] and the Hilbert-Schmidt norm by \[ \|\Gamma\|_{HS}=\sum_{j\geq 1} \lambda^2_j<\infty. \]

  • \(\Gamma^{1/2}=\sum_{j\geq 1}\lambda_j^{1/2}v_j\otimes v_j\) and \(\Gamma^p=\sum_{j\geq 1}\lambda_j^pv_j\otimes v_j.\)

Functional Principal Component Analysis

  • The curves \(Y_k\) generate a space of very high dimension, at most \(n\);
  • We want to find a space of dimension \(q,\) with much smaller than \(n,\) that best represents the deviation of the curves \(Y_k=Y_k-\mu;\) (we centre the data);
  • Let \(\phi_1, \phi_2, \ldots, \phi_q\) be an orthonormal basis of this space of dimension \(q,\) then the projector \(P_q\) of \(\tilde Y_k\) is given by: \[ P_q(\tilde Y_k ) = \sum_{j =1}^q \langle \tilde Y_k,\phi_j \rangle \phi_j . \]
  • Considering a quadratic risk, we would like to minimize: \[\begin{aligned} \min_{(\phi_j)_{1 \leq j \leq q}} R_q (\phi_1, \phi_2, \ldots, \phi_q) & = \min_{\phi_j, j=1, \ldots, q} \frac{1}{n} \sum_{k=1}^{n} \left| \tilde Y_k - \sum_{j=1}^q \langle \tilde Y_k ,\phi_j \rangle \phi_j \right|^2. \end{aligned}\]
  • The quadratic risk can be written: \[ R_q (\phi_1, \phi_2, \ldots, \phi_q) = \frac{1}{n} \sum_{k=1}^n \left| Y_k - \mu \right|^2 - \sum_{j=1}^q \langle \Gamma \phi_j, \phi_j \rangle \]
  • Thanks to the properties on the maximum of the eigenvalues (Chatelin 1983) we obtain that the minimum of \(R_q (\phi_1, \phi_2, \ldots, \phi_q)\) is reached for \[ \phi_1=v_1, \ldots, \phi_q=v_q. \]

Therefore, the optimal subspace of dimension \(q,\) (unique if \(\lambda_j\) is distinct), is the space generated by the \(q\) eigenvectors associated to the \(q\) largest eigenvalues of \(\Gamma;\)%Thus the optimal subspace of dimension \(q,\) which is unique if \(\lambda_q > \lambda_{q+1},\) is the space generated by the \(q\) eigenfunctions of \(\Gamma\) associated to the \(q\) largest eigenvalues. \[ Y_k(t) = \mu(t) + \sum_{j =1}^q \langle Y_k - \mu,v_j \rangle v_j(t) + R_{q,k}(t), \quad t \in [0,T] \]

  • The space generated by \(v_1, \cdots, v_q\) gives a representation of the variation over time of the data around the mean.

  • The variance of the projection on \(v_j\) is given by: \[ \lambda_j = \frac{1}{n} \sum_{k=1}^n \langle Y_k - \mu,v_j \rangle^2 \ \] because \(\frac{1}{n} \sum_{k=1}^n \langle Y_k - \mu,v_j \rangle=0.\)

The functional median

  • The notion of quantile is not generalisable to \(\mathbb R^p\) because we cannot give a total order relation in \(\mathbb R^p\); the same thing in a more general space like \(L^2[0, T];\)

  • Several definitions have been proposed of a concept that could be considered a quantile but none has the basic property of a quantile of order \(\alpha\) in \(R,\) having \(\alpha\%\) of the distribution less than this value;

  • A definition for the median \[ m = \underset{y\in L^2[0,T]}{\operatorname{argmin}} \sum _{i=1}^n \|Y_i - y\|; \]

Properties

  • This definition is the natural generalisation of the definition of the median in \(\mathbb R\) as \[ \underset{a\in \mathbb R}{\operatorname{argmin}} \sum_{i=1}^n|Y_i-a| \]
  • If \(Y_i\neq m\) for all \(i=1, \ldots, n\), then the median is the unique solution of \[ \sum_{i=1}^n\frac{Y_i-m}{\|Y_i-m\|}=0. \] The solution is found by iterative methods.
  • This is a central indicator of the distribution of curves and has a robustness property: if a point \(Y_i\) is moved to infinity in the direction which connects \(Y_i\) by \(m,\) then the median does not change.

Functional linear regression

Framework

  • New questions arise:
    1. How do you build a model to predict a functional variable or predict a real variable by a functional variable?
    2. How do you calculate the predictions in these cases and the measures of uncertainty (bias, variance, confidence interval)?
    3. What is the equivalent of \(R^2\) and what interpretation should be given?
  • Several situations are possible in this new functional framework:
    1. \(Y\) functional variable and \(X_j\) real variables (function-on-scalar);
    2. \(Y\) is real and the predictors are functional variables (scalar-on-function);
    3. \(Y\) and the predictors are functional variables (function-on-function);

Case 1: functional \(Y\) and real \(X_j\) predictors

  • Let \(X_1, \ldots, X_p\) be real variables and \(\mathbf{x}'_i=(1,X_{i1}, \ldots, X_{ip});\)

  • The data are \((\mathbf{x}'_i, Y_i(t))\) for \(i=1, \ldots, n;\)

  • The following linear model (Faraway 1997) \[ M1:\quad Y_i(t)=\mathbf{x}'_i\boldsymbol{\beta}(t)+\varepsilon_{i}(t), \quad t\in [0,T] \] where \(\boldsymbol{\beta}(t)=(\beta_0(t), \beta_1(t), \ldots, \beta_p(t))'\) is the functional regression coefficient and the functional residuals \(\varepsilon_{k}(t)\) are centred : \[ \mathbb E(\varepsilon_{i}(t))=0 \] and decorrelated with each the same covariance function \(\gamma\): \[ \operatorname{cov} (\varepsilon_{i}(t),\varepsilon_{j}(t))=0, \quad i\neq j, t\in [0,T] \] \[ \operatorname{cov} (\varepsilon_{i}(t),\varepsilon_{i}(r))=\gamma(t,r), \quad t,r\in [0,T] \]

Estimation of the regression coefficient function \(\boldsymbol\beta\)

  • It is assumed that the matrix \(\mathbf{X}=(\mathbf{x}'_i)_{i=1}^n\) of size \((n,p+1)\) is of rank \(p+1;\)

  • The pointwise estimator of \(\boldsymbol{\beta}\) (for \(t\) fixed) is obtained by ordinary least squares: \[\begin{aligned} \widehat{\boldsymbol{\beta}}(t) & = \underset{\beta(t)}{\operatorname{argmin}} \sum_{i=1}^n(Y_i(t)-\mathbf{x}'_i\boldsymbol{\beta}(t))^2=\|\mathbf{Y}(t)-\mathbf{X}\boldsymbol{\beta}(t)\|^2 \\ & = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\mathbf{Y}(t) \end{aligned}\] where \(\mathbf{Y}(t)=(Y_i(t))_{i=1}^n;\)

  • It can be shown that the estimator \(\widehat{\boldsymbol{\beta}}=(\widehat{\boldsymbol{\beta}}(t))_{t\in [0, T]}\) satisfies the global criterion: \[ \widehat{\boldsymbol{\beta}}=\mbox{argmin}_{\beta}\int_{0}^T\|\mathbf{Y}(t)-\mathbf{X}\boldsymbol{\beta}(t)\|^2dt \]

Residual analysis

  • The predictions are \(\widehat{\mathbf{Y}}(t)=\mathbf{X}\widehat{\boldsymbol{\beta}}(t)=\mathbf{P}_X\mathbf{Y}(t);\)

  • The estimated residuals are \(\widehat{\boldsymbol{\varepsilon}}(t)=\mathbf{Y}(t)-\widehat{\mathbf{Y}}(t)=\left(\mathbf{I}_n-\mathbf{P}_X\right)\mathbf{Y}(t):\) \[ \mathbb E(\widehat{\boldsymbol{\varepsilon}}(t))=0 \] \[ \operatorname{cov}(\widehat{\boldsymbol{\varepsilon}}(t), \widehat{\boldsymbol{\varepsilon}}(r))=\gamma(t,r)\left(\mathbf{I}_n-\mathbf{P}_X\right) \] \[ \mathbb E(\widehat{\boldsymbol{\varepsilon}}(t)'\widehat{\boldsymbol{\varepsilon}}(r))=(n-p-1)\gamma(t,r) \] so an unbiased estimator of \(\gamma(t,r)\) is given by \[ \widehat{\gamma}(t,r)=\frac{1}{n-p-1}\sum_{i=1}^n(Y_i(t)-\widehat Y_i(t))(Y_i(r)-\widehat Y_i(r)) \]

The \(R^2(t)\)

  • The sums of squares regression, error and total for a fixed \(t\) are given by: \[ SSE(t)=\sum_{i=1}^n(Y_i(t)-\widehat{Y}_i(t))^2, \quad SSR(t)=\sum_{i=1}^n(\widehat{Y}_i(t)-\mu(t))^2 \] \[ SST(t)=\sum_{i=1}^n(Y_i(t)-\mu(t))^2=SSE(t)+SSR(t) \]
  • For each fixed \(t\), we can define the \[ R^2(t)= \frac{SSR(t)}{SST(t)}, \quad t\in [0, T] \]
  • We can define a global measure for the predictive quality of the model: \[\begin{aligned} R^2 &=\frac{\int_0^TSSR(t) \mathrm dt}{\int_0^TSST(t) \mathrm dt} \\ &=\int_0^Tw_n(t)R_n(t) \mathrm dt \\ &\simeq \frac{\sum_{d=1}^DSSR(t_d)}{\sum_{d=1}^DSST(t_d)}, \quad w_n(t)=\frac{SST(t)}{\int_0^TSST(t) \mathrm dt} \end{aligned}\]

Case 2: scalar \(Y\) and functional \(X_j\) predictors

Implementation in R

Code
DTI1 <- DTI[DTI$visit == 1 & complete.cases(DTI), ]
par(mfrow = c(1, 2))
# Fit model with linear functional term for CCA
fit.lf <- pfr(pasat ~ lf(cca, k = 30, bs = "ps"), data = DTI1)
plot(fit.lf, ylab = expression(paste(beta(t))), xlab = "t")

Figure 7: Implementation of scalar-on-function regression with refund::pfr.

Case 3.1: functional \(Y\) and functional \(X_j\) predictors (concurrent models)

  • Let \(X_1, \ldots, X_p\) be functional variables, \(X_j=(X_j(t))_{t\in [0, T]};\) let \(\mathbf{x}'_i(t)=(1,X_{i1}(t), \ldots, X_{ip}(t));\)

  • The data are \((\mathbf{x}'_i(t), Y_i(t))\) for \(i=1, \ldots, n;\) \[ M1:\quad Y_i(t)=\mathbf{x}'_i(t)\boldsymbol{\beta}(t)+\varepsilon_{i}(t), \quad t\in [0,T] \] where \(\boldsymbol{\beta}(t)=(\beta_0(t), \beta_1(t), \ldots, \beta_p(t))'\) is the functional regression coefficient and the functional residuals \(\varepsilon_{k}(t)\) are centred: \[ \mathbb E(\varepsilon_{i}(t))=0 \] and decorrelated with each the same covariance function \(\gamma\): \[ \operatorname{cov} (\varepsilon_{i}(t),\varepsilon_{j}(t))=0, \quad i\neq j, t\in [0,T] \] \[ \operatorname{cov} (\varepsilon_{i}(t),\varepsilon_{i}(r))=\gamma(t,r), \quad t,r\in [0,T] \]

Estimation of the regression coefficient function \(\boldsymbol{\beta}\)

  • It is assumed that the matrix \(\mathbf{X}(t)=(\mathbf{x}'_i(t))_{i=1}^n\) of size \((n,p+1)\) is of rank \(p+1\) for each \(t\in [0, T];\)

  • The point estimator of \(\boldsymbol{\beta}\) (for \(t\) fixed) is obtained by ordinary least squares: \[\begin{aligned} \widehat{\boldsymbol{\beta}}(t) & = \underset{\beta(t)}{\operatorname{argmin}} \sum_{i=1}^n(Y_i(t)-\mathbf{x}'_i(t)\boldsymbol{\beta}(t))^2=\|\mathbf{Y}(t)-\mathbf{X}(t)\boldsymbol{\beta}(t)\|^2\\ & = (\mathbf{X(t)}'\mathbf{X}(t))^{-1}\mathbf{X}(t)\mathbf{Y}(t) \end{aligned}\] where \(\mathbf{Y}(t)=(Y_i(t))_{i=1}^n;\)

  • It can be shown that the estimator \(\widehat{\boldsymbol{\beta}}=(\widehat{\boldsymbol{\beta}}(t))_{t\in [0, T]}\) satisfies the global criterion: \[ \widehat{\boldsymbol{\beta}}= \underset{\beta}{\operatorname{argmin}} \int_{0}^T\|\mathbf{Y}(t)-\mathbf{X}(t)\boldsymbol{\beta}(t)\|^2 \mathrm dt \]

Bibliography

Besse P. C., Cardot H. and Stephenson D. B. 2000. Autoregressive Forecasting of Some Functional Climatic Variations.” Scandinavian Journal of Statistics 27 (4) : 673–87.
Cardot H., Faivre R. and Goulard M. 2003. Functional approaches for predicting land use with the temporal evolution of coarse resolution remote sensing data.” Journal of Applied Statistics, December.
Chatelin F. 1983. Spectral Approximation of Linear Operators. Society for Industrial; Applied Mathematics.
Dauxois J., Pousse A. and Romain Y. 1982. Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference.” Journal of Multivariate Analysis 12 (1) : 136–54.
Deville J.-C. 1974. Méthodes statistiques et numériques de l’analyse harmonique.” Annales de l’INSEE, no. 15 : 3–101.
Faraway J. J. 1997. Regression Analysis for a Functional Response.” Technometrics 39 (3) : 254–61.
Kneip A. and Utikal K. J. 2001. Inference for Density Families Using Functional Principal Component Analysis.” Journal of the American Statistical Association 96 (454) : 519–42.
Ramsay J. O. 1982. When the data are functions.” Psychometrika 47 (4) : 379–96.
Ramsay J. O., and Silverman B. W. 1997. Functional Data Analysis. Springer Series in Statistics. New York, NY : Springer New York.