santaR Theoretical Background
Arnaud Wolfer
2017-12-30
santaR
is a Functional Data Analysis (FDA) approach where each individual’s observations are condensed to a continuous smooth function of time, which is then employed as a new analytical unit in subsequent data analysis.
santaR
is designed to be automatically applicable to a high number of variables, while simultaneously accounting for:
- biological variability
- measurement error
- missing observations
- asynchronous sampling
- nonlinearity
- low number of time points (e.g. 4-10)
This vignette focusses on the theoretical background underlying santaR
and will detail:
- Concept of Functional Data Analysis
- Main hypothesis
- Signal extraction using splines
- Group mean curve fitting
- Intra-class variability (confidence bands on the Group mean curves)
- Detection of significantly altered time-trajectories
Concept of Functional Data Analysis
Traditional times-series methodologies consider each measurement (or observation) - for example a metabolic response value at a defined time-point - as the basic representative unit. In this context, a time trajectory is composed of a set of discrete observations taken at subsequent times. Functional data analysis is a framework proposed by Ramsay and Silverman which models continuously each measured values as variable evolves (e.g. time). Contrary to traditional time-series approaches, FDA consider a subject’s entire time-trajectory as a single data point. This continuous representation of the variable through time is then employed for further data analysis.
More precisely, FDA assumes that a variable’s time trajectory is the result of a smooth underlying function of time which must be estimated. Therefore, the analysis’ first step is to convert a subject’s successive observations to the underlying function \(F(t)\) that can be evaluated at any desired time values
Two main challenges prescribe the exact characterisation of the true underlying function of time \(F(t)\). First, the limited number of discrete observations constrains to a finite time interval and does not cover all possible values, while the true underlying smooth process is infinitely defined. Secondly, intrinsic limitations and variabilities of the analytical platforms mean that the discrete observations are not capturing the true underlying function values, but a succession of noisy realisations.
A metabolite’s concentration over the duration of the experiment is the real quantity of interest. As the true smooth process \(F(t)\) cannot be evaluated, due to the collection and analytical process only providing noisy snapshots of its value, an approximation of the underlying metabolite concentration \(f(t)\) must be employed. This approximation can be computed by smoothing the multiple single measurements available.
The approximation of the underlying function of time, of a single variable observed on a single subject, is generated by smoothing \(y_i\) observations taken at the time \(t_i\), such as:
\[y_{i}=f(t_{i})+\epsilon_{i}\]
Where \(f(t)\) is the smoothed function that can be evaluated at any desired time \(t\) within the defined time domain, and \(\epsilon_i\) is an error term allowing for measurement errors. By extension, if \(f(t)\) is defined as representing the true variable level across a population, \(\epsilon\) can allow for both individual specific deviation and measurement error.
In order to parametrise and computationally manipulate the smooth approximation, the representation is projected onto a finite basis, which expresses the infinitesimally dimensional \(f(t)\) as a linear combination of a set of basis functions. A wide range of basis functions (e.g. polynomials, splines, wavelets, Fourier basis) are available, even if the most commonly used are Fourier basis for periodic data and splines for aperiodic trajectories.
Following smoothing, a time-trajectory is expressed as a linear combination of a set of basis functions, for which values can be computer at any desired time without reconsidering the discrete observations. It is assumed that, if a satisfactory fit of the original data is obtained, both the new smooth approximation and the preceding noisy realisations should provide a similar representation of the underlying process
In practice, by assuming and parametrising (explicitely or implicitely) the smoothness of the function and leveraging multiple measurements in time as well as multiple individual trajectories, the smoothed representation can implicitly handle noisy metabolic responses. Additionally, as functional representations do not rely on the input data after their generation, smoothing approaches can handle irregular sampling and missing observations, translating to trajectories that can be compared without requiring missing values imputation. As such, the FDA framework and the smoothness assumption provide the necessary tools for the time-dependent analysis of short and noisy metabolic time-trajectories.
Main hypothesis
The working hypothesis is that analyte values are representative of the underlying biological mechanism responsible for the observed phenotype and evolve smoothly through time. The resulting measurements are noisy realisations of these underlying smooth functions of time.
Based on previous work performed in Functional Data Analysis, short time evolutions can be analysed by estimating each individual trajectory as a smooth spline or curve. These smooth curves collapse the point by point information into a new observational unit that enables the calculation of a group mean curve or individual curves, each representative of the true underlying function of time.
Further data analysis then results in the estimation of the intra-class variability and the identification of time trajectories significantly altered between classes.
Group mean curve fitting
Individual time-trajectories provide with subject-specific approximations of the underlying function of time. If a homogenous subset of the population (or group) can be devised, the replication of the experiment over this group can be leveraged to generate a group mean curve. This group mean curve provide with an approximation of the underlying function of time potentially dissociated of inter-individual variability.
A group mean curve \(g(t)\), representing the underlying function of time for a subset of the population, is defined as the mean response across all \(i\) group subjects (where \(i=1,.,n_k\)) (Equation 3): \[g(t)=\frac{1}{n_k} \sum_{i=1}^{n_k} f_i(t) = \overline{f_i(t)}\] (Eq. 3)
Following the framework of FDA, each individual trajectory is employed as the new basic unit of information. The group mean curve therefore employs the back-projection of each individual curve at every time of interest, instead of relying on each discrete measurement. This way the group mean curve fit is resilient to missing samples in a subject trajectory or asynchronous measurement across individuals.
Multiple group mean curve fits (Center and Right) based on measurements generated from a true smooth underlying function of time (Left) with added inter-individual variability and missing observations. Multiple individual trajectories (Left, dashed green lines) are generated from a true underlying function of time (Left, continuous green line). Discrete measurements are generated (dots). Group mean curves are computed on a subset of these measurements (Center and Right, continuous blue line, 5 effective degrees of freedom). When the group mean curve is generated solely based on measurements (Center), the true underlying function of time is not satisfactorily approximated (fundamental unit of information is the observation). When the group mean curve (Right, continuous blue line) is generated from each individual curves (Right, dashed blue lines) a satisfactory approximation of the true underlying function of time can be obtained (fundamental unit of information is the functional representation of each individual).
Intra-class variability (confidence bands on the group mean curves)
Confidence bands are employed to represent the uncertainty in an function’s estimate based on limited or noisy data. Pointwise confidence bands on the group mean curve can therefore provide a visualisation of the variance of the overall mean function through time. The confidence bands indicate for each possible time, values of the mean curve which could have been produced by the provided individual trajectories, with at least a given probability.
As the underlying distribution of \(F(t)\) is unknown and inaccessible, confidence on \(g(t)\) must be estimated by sampling the approximate distribution available (i.e. the measurements), achieved by bootstrapping the observations . Bootstrap assumes that sampling the distribution under a good estimate of the truth (the measurements) should be close to sampling the distribution under the true process. Once a bootstrapped distribution of mean curve is estimated, using the percentile method a pointwise confidence interval on \(g(t)\) can be calculated by taking the \(\alpha/2\) and \(1-\alpha/2\) quantiles at each time-point of interest (where \(\alpha\) is the error quantile in percent).
As limited assumptions on the fitted model can be established, empirical confidence bands on the group mean curve are calculated based on a non-parametric bootstrapped distribution. Individual curves are sampled with replacement and group mean curves calculated. The distribution of group mean curve is then employed to estimate the pointwise confidence bands.
Detection of significantly altered time-trajectories
With a continuous time representation of a metabolite’s concentration accessible (i.e. the group mean curve), a metric quantifying a global difference between such curves can be devised. This measure of difference, independent of the original sampling of the data (e.g. asynchronous and missing measurements), can then be employed to determine whether the group mean curves are identical across the two groups and calculate the significance of the result.
In order to identify time-trajectories that are significantly altered between two experimental groups \(G_1\) and \(G_2\), this global measure must be established. The measure must be able to detect baseline difference when both curves present the same shape, but also shape differences when the mean values across time are identical for both groups. We postulate that the area between two group mean curves fitted independently \(g_1(t)\) and \(g_2(t)\) (Eq. 3) is a good measure of the degree to which the two group time-trajectories differ. This global distance measure is calculated as:
\[dist(g_1,g_2) = \int_{t_{min}}^{t_{max}} |g_1(t)-g_2(t)| ~dt\] (Eq. 4)
Where:
- \(t_{min}\) and \(t_{max}\) are the lower and upper bound of the time course.
The global distance measure is a quantification of evidence for differential time evolution, and the larger the value, the more the trajectories differ.
Given the groups \(G_1\), \(G_2\), their respective group mean curves \(g_1(t)\), \(g_2(t)\) and the observed global distance measure \(dist(g_1,g_2)\), the question of the significance of the difference between the two time trajectories is formulated as the following hypothesis-testing problem:
- \(H_0\): \(g_1(t)\) and \(g_2(t)\) are identical.
- \(H_1\): The two curves are not the same.
Under the null hypothesis \(H_0\), the global distance between \(g_1(t)\) and \(g_2(t)\) can be generated by a randomly permuting the individuals between two groups. Under the alternate hypothesis \(H_1\), the two curves are different, and very few random assignment of individuals to groups will generate group mean curves that differ more than the observed ones (i.e. a larger global distance).
In order to test the null hypothesis, the observed global distance must be compared to the expected global distance under the null hypothesis. The null distribution of the global distance measure is inferred by repeated permutations of the individual class labels. Individual trajectories are fit using the chosen df and randomly assigned in one of two groups of size \(s_1\) and \(s_2\) (identical to the size of \(G_1\) and \(G_2\) respectively). The corresponding group mean curves \(g'_1(t)\) and \(g'_2(t)\) computed, and a global distance \(dist(g'_1,g'_2)\) calculated. This is repeated many times to form the null distribution of the global distance. From this distribution a \(p\)-value can be computed as the proportion of global distance as extreme or more extreme than the observed global distance (\(dist(g_1,g_2)\)). The \(p\)-value indicates the probability (if the null hypothesis was true) of a global difference as, or more, extreme than the observed one. If the \(p\)-value is inferior to an agreed significance level, the null hypothesis is rejected.
Due to the stochastic nature of the permutations, the null distribution sampled depends on the random draw, resulting in possible variation of the \(p\)-value. Variation on the permuted \(p\)-value can be described by confidence intervals. The \(p\)-value is considered as the success probability parameter \(p\) in \(n\) independent Bernoulli trials, where \(n=B\) the total number of permutation rounds. Confidence intervals on this binomial distribution of parameter \(p\) and \(n\) can then be computed. As extreme \(p\)-values are possible (i.e close to 0) and the number of permutation rounds could be small, the Wilson score interval is employed to determine the confidence intervals (Eq. 5):
\[p_\pm = \frac{1}{1+\frac{1}{B}z^2}\left ( p + \frac{1}{2B}z^2 \pm z\sqrt{\frac{1}{B} p (1 - p) + \frac{1}{4B^2}z^2 } \right )\] (Eq. 5)
Where:
- \(p_\pm\) is the upper or lower confidence interval.
- \(p\) is the calculated \(p\)-value.
- \(B\) the total number of permutation rounds.
- \(z\) is the \(1-1/2\alpha\) quantile of the standard normal distribution, with \(\alpha\) the error quantile.
Lastly, if a large number of variables are investigated for significantly altered trajectories, significance results must be corrected for multiple hypothesis testing. Here we employ the well-known Benjamini & Hochberg false discovery rate approach as default, with other methods available as an option.