santaR Theoretical Background
Arnaud Wolfer
2019-10-03
santaR
is a Functional Data Analysis (FDA)
approach where each individual’s observations are condensed to a
continuous smooth function of time, which is then employed as a new
analytical unit in subsequent data analysis.
santaR
is designed to be automatically applicable to a
high number of variables, while simultaneously accounting
for:
- biological variability
- measurement error
- missing observations
- asynchronous sampling
- nonlinearity
- low number of time points (e.g. 4-10)
This vignette focusses on the theoretical background underlying
santaR
and will detail:
- Concept of Functional Data Analysis
- Main hypothesis
- Signal extraction using splines
- Group mean curve fitting
- Intra-class variability (confidence bands on the Group mean
curves)
- Detection of significantly altered time-trajectories
Concept of Functional Data Analysis
Traditional times-series methodologies consider each measurement (or
observation) - for example a metabolic response value at a defined
time-point - as the basic representative unit. In this context, a time
trajectory is composed of a set of discrete observations taken at
subsequent times. Functional data analysis is a framework
proposed by Ramsay and Silverman which models continuously each measured
values as variable evolves (e.g. time). Contrary to traditional
time-series approaches, FDA consider a subject’s entire time-trajectory
as a single data point. This continuous representation of the variable
through time is then employed for further data analysis.
More precisely, FDA assumes that a variable’s time trajectory is the
result of a smooth underlying function of time which must be estimated.
Therefore, the analysis’ first step is to convert a subject’s successive
observations to the underlying function \(F(t)\) that can be evaluated at any desired
time values
Two main challenges prescribe the exact characterisation of the true
underlying function of time \(F(t)\).
First, the limited number of discrete observations constrains to a
finite time interval and does not cover all possible values, while the
true underlying smooth process is infinitely defined. Secondly,
intrinsic limitations and variabilities of the analytical platforms mean
that the discrete observations are not capturing the true underlying
function values, but a succession of noisy realisations.
A metabolite’s concentration over the duration of the experiment is
the real quantity of interest. As the true smooth process \(F(t)\) cannot be evaluated, due to the
collection and analytical process only providing noisy snapshots of its
value, an approximation of the underlying metabolite concentration \(f(t)\) must be employed. This approximation
can be computed by smoothing the multiple single measurements
available.
The approximation of the underlying function of time, of a single
variable observed on a single subject, is generated by smoothing \(y_i\) observations taken at the time \(t_i\), such as:
\[y_{i}=f(t_{i})+\epsilon_{i}\]
Where \(f(t)\) is the smoothed
function that can be evaluated at any desired time \(t\) within the defined time domain, and
\(\epsilon_i\) is an error term
allowing for measurement errors. By extension, if \(f(t)\) is defined as representing the true
variable level across a population, \(\epsilon\) can allow for both individual
specific deviation and measurement error.
In order to parametrise and computationally manipulate the smooth
approximation, the representation is projected onto a finite basis,
which expresses the infinitesimally dimensional \(f(t)\) as a linear combination of a set of
basis functions. A wide range of basis functions (e.g. polynomials,
splines, wavelets, Fourier basis) are available, even if the most
commonly used are Fourier basis for periodic data and splines for
aperiodic trajectories.
Following smoothing, a time-trajectory is expressed as a linear
combination of a set of basis functions, for which values can be
computer at any desired time without reconsidering the discrete
observations. It is assumed that, if a satisfactory fit of the original
data is obtained, both the new smooth approximation and the preceding
noisy realisations should provide a similar representation of the
underlying process
In practice, by assuming and parametrising (explicitely or
implicitely) the smoothness of the function and leveraging multiple
measurements in time as well as multiple individual trajectories, the
smoothed representation can implicitly handle noisy metabolic responses.
Additionally, as functional representations do not rely on the input
data after their generation, smoothing approaches can handle irregular
sampling and missing observations, translating to trajectories that can
be compared without requiring missing values imputation. As such, the
FDA framework and the smoothness assumption provide the necessary tools
for the time-dependent analysis of short and noisy metabolic
time-trajectories.
Main hypothesis
The working hypothesis is that analyte values are representative of
the underlying biological mechanism responsible for the observed
phenotype and evolve smoothly through time. The resulting measurements
are noisy realisations of these underlying smooth functions of time.
Based on previous work performed in Functional Data Analysis, short
time evolutions can be analysed by estimating each individual trajectory
as a smooth spline or curve. These smooth curves collapse the point by
point information into a new observational unit that enables the
calculation of a group mean curve or individual curves, each
representative of the true underlying function of time.
Further data analysis then results in the estimation of the
intra-class variability and the identification of time trajectories
significantly altered between classes.
Group mean curve fitting
Individual time-trajectories provide with subject-specific
approximations of the underlying function of time. If a homogenous
subset of the population (or group) can be devised, the replication of
the experiment over this group can be leveraged to generate a group mean
curve. This group mean curve provide with an approximation of the
underlying function of time potentially dissociated of inter-individual
variability.
A group mean curve \(g(t)\),
representing the underlying function of time for a subset of the
population, is defined as the mean response across all \(i\) group subjects (where \(i=1,.,n_k\)) (Equation 3): \[g(t)=\frac{1}{n_k} \sum_{i=1}^{n_k} f_i(t) =
\overline{f_i(t)}\] (Eq. 3)
Following the framework of FDA, each individual trajectory is
employed as the new basic unit of information. The group mean curve
therefore employs the back-projection of each individual curve at every
time of interest, instead of relying on each discrete measurement. This
way the group mean curve fit is resilient to missing samples in a
subject trajectory or asynchronous measurement across individuals.
Multiple group mean curve fits (Center and Right) based on
measurements generated from a true smooth underlying function of time
(Left) with added inter-individual variability and missing observations.
Multiple individual trajectories (Left, dashed green lines) are
generated from a true underlying function of time (Left, continuous
green line). Discrete measurements are generated (dots). Group mean
curves are computed on a subset of these measurements (Center and Right,
continuous blue line, 5 effective degrees of freedom). When the group
mean curve is generated solely based on measurements (Center), the true
underlying function of time is not satisfactorily approximated
(fundamental unit of information is the observation). When the group
mean curve (Right, continuous blue line) is generated from each
individual curves (Right, dashed blue lines) a satisfactory
approximation of the true underlying function of time can be obtained
(fundamental unit of information is the functional representation of
each individual).
Intra-class variability (confidence bands on the group mean
curves)
Confidence bands are employed to represent the uncertainty in an
function’s estimate based on limited or noisy data. Pointwise confidence
bands on the group mean curve can therefore provide a visualisation of
the variance of the overall mean function through time. The confidence
bands indicate for each possible time, values of the mean curve which
could have been produced by the provided individual trajectories, with
at least a given probability.
As the underlying distribution of \(F(t)\) is unknown and inaccessible,
confidence on \(g(t)\) must be
estimated by sampling the approximate distribution available (i.e. the
measurements), achieved by bootstrapping the observations . Bootstrap assumes
that sampling the distribution under a good estimate of the truth (the
measurements) should be close to sampling the distribution under the
true process. Once a bootstrapped distribution of mean curve is
estimated, using the percentile method a pointwise confidence interval
on \(g(t)\) can be calculated by
taking the \(\alpha/2\) and \(1-\alpha/2\) quantiles at each time-point
of interest (where \(\alpha\) is the
error quantile in percent).
As limited assumptions on the fitted model can be established,
empirical confidence bands on the group mean curve are calculated based
on a non-parametric bootstrapped distribution. Individual curves are
sampled with replacement and group mean curves calculated. The
distribution of group mean curve is then employed to estimate the
pointwise confidence bands.
Detection of significantly altered time-trajectories
With a continuous time representation of a metabolite’s concentration
accessible (i.e. the group mean curve), a metric quantifying a global
difference between such curves can be devised. This measure of
difference, independent of the original sampling of the data
(e.g. asynchronous and missing measurements), can then be employed to
determine whether the group mean curves are identical across the two
groups and calculate the significance of the result.
In order to identify time-trajectories that are significantly altered
between two experimental groups \(G_1\)
and \(G_2\), this global measure must
be established. The measure must be able to detect baseline difference
when both curves present the same shape, but also shape differences when
the mean values across time are identical for both groups. We postulate
that the area between two group mean curves fitted independently \(g_1(t)\) and \(g_2(t)\) (Eq. 3) is a good measure of the
degree to which the two group time-trajectories differ. This global
distance measure is calculated as:
\[dist(g_1,g_2) = \int_{t_{min}}^{t_{max}}
|g_1(t)-g_2(t)| ~dt\] (Eq. 4)
Where:
- \(t_{min}\) and \(t_{max}\) are the lower and upper bound of
the time course.
The global distance measure is a quantification of evidence for
differential time evolution, and the larger the value, the more the
trajectories differ.
Given the groups \(G_1\), \(G_2\), their respective group mean curves
\(g_1(t)\), \(g_2(t)\) and the observed global distance
measure \(dist(g_1,g_2)\), the question
of the significance of the difference between the two time trajectories
is formulated as the following hypothesis-testing problem:
- \(H_0\): \(g_1(t)\) and \(g_2(t)\) are identical.
- \(H_1\): The two curves are not the
same.
Under the null hypothesis \(H_0\),
the global distance between \(g_1(t)\)
and \(g_2(t)\) can be generated by a
randomly permuting the individuals between two groups. Under the
alternate hypothesis \(H_1\), the two
curves are different, and very few random assignment of individuals to
groups will generate group mean curves that differ more than the
observed ones (i.e. a larger global distance).
In order to test the null hypothesis, the observed global distance
must be compared to the expected global distance under the null
hypothesis. The null distribution of the global distance measure is
inferred by repeated permutations of the individual class labels.
Individual trajectories are fit using the chosen df and
randomly assigned in one of two groups of size \(s_1\) and \(s_2\) (identical to the size of \(G_1\) and \(G_2\) respectively). The corresponding
group mean curves \(g'_1(t)\) and
\(g'_2(t)\) computed, and a global
distance \(dist(g'_1,g'_2)\)
calculated. This is repeated many times to form the null distribution of
the global distance. From this distribution a \(p\)-value can be computed as the proportion
of global distance as extreme or more extreme than the observed global
distance (\(dist(g_1,g_2)\)). The \(p\)-value indicates the probability (if the
null hypothesis was true) of a global difference as, or more, extreme
than the observed one. If the \(p\)-value is inferior to an agreed
significance level, the null hypothesis is rejected.
Due to the stochastic nature of the permutations, the null
distribution sampled depends on the random draw, resulting in possible
variation of the \(p\)-value. Variation
on the permuted \(p\)-value can be
described by confidence intervals. The \(p\)-value is considered as the success
probability parameter \(p\) in \(n\) independent Bernoulli trials, where
\(n=B\) the total number of permutation
rounds. Confidence intervals on this binomial
distribution of parameter \(p\) and
\(n\) can then be computed. As extreme
\(p\)-values are possible (i.e close to
0) and the number of permutation rounds could be small, the Wilson score
interval is employed to determine the confidence
intervals (Eq. 5):
\[p_\pm = \frac{1}{1+\frac{1}{B}z^2}\left
( p + \frac{1}{2B}z^2 \pm z\sqrt{\frac{1}{B} p (1 - p) +
\frac{1}{4B^2}z^2 } \right )\] (Eq. 5)
Where:
- \(p_\pm\) is the upper or lower
confidence interval.
- \(p\) is the calculated \(p\)-value.
- \(B\) the total number of
permutation rounds.
- \(z\) is the \(1-1/2\alpha\) quantile of the standard
normal distribution, with \(\alpha\)
the error quantile.
Lastly, if a large number of variables are investigated for
significantly altered trajectories, significance results must be
corrected for multiple hypothesis testing. Here we employ the well-known
Benjamini & Hochberg false discovery rate approach as default, with
other methods available as an option.