Peak selection in metabolic profiles using functional data analysis
Author(s)
Doehring, Orlando
Type
Thesis
Abstract
In this thesis we describe sparse principal component analysis (PCA) methods and apply
them to the analysis of short multivariate time series in order to perform both dimensionality
reduction and variable selection. We take a functional data analysis (FDA) modelling
approach in which each time series is treated as a continuous smooth function of time or
curve.
These techniques have been applied to analyse time series data arising in the area
of metabonomics. Metabonomics studies chemical processes involving small molecule
metabolites in a cell. We use experimental data obtained from the COnsortium for MEtabonomic
Toxicology (COMET) project which is formed by six pharmaceutical companies and
Imperial College London, UK. In the COMET project repeated measurements of several
metabolites over time were collected which are taken from rats subjected to different drug
treatments. The aim of our study is to detect important metabolites by analysing the multivariate
time series.
Multivariate functional PCA is an exploratory technique to describe the observed time
series. In its standard form, PCA involves linear combinations of all variables (i.e. metabolite
peaks) and does not perform variable selection. In order to select a subset of important
metabolites we introduce sparsity into the model. We develop a novel functional Sparse
Grouped Principal Component Analysis (SGPCA) algorithm using ideas related to Least
Absolute Shrinkage and Selection Operator (LASSO), a regularized regression technique,
with grouped variables. This SGPCA algorithm detects a sparse linear combination of
metabolites which explain a large proportion of the variance. Apart from SGPCA, we also propose two alternative approaches for metabolite selection. The first one is based on
thresholding the multivariate functional PCA solution, while the second method computes
the variance of each metabolite curve independently and then proceeds to these rank curves
in decreasing order of importance. To the best of our knowledge, this is the first application
of sparse functional PCA methods to the problem of modelling multivariate metabonomic
time series data and selecting a subset of metabolite peaks.
We present comprehensive experimental results using simulated data and COMET project
data for different multivariate and functional PCA variants from the literature and for SGPCA
. Simulation results show that that the SGPCA algorithm recovers a high proportion
of truly important metabolite variables. Furthermore, in the case of SGPCA applied to the
COMET dataset we identify a small number of important metabolites independently for
two different treatment conditions. A comparison of selected metabolites in both treatment
conditions reveals that there is an overlap of over 75 percent.
them to the analysis of short multivariate time series in order to perform both dimensionality
reduction and variable selection. We take a functional data analysis (FDA) modelling
approach in which each time series is treated as a continuous smooth function of time or
curve.
These techniques have been applied to analyse time series data arising in the area
of metabonomics. Metabonomics studies chemical processes involving small molecule
metabolites in a cell. We use experimental data obtained from the COnsortium for MEtabonomic
Toxicology (COMET) project which is formed by six pharmaceutical companies and
Imperial College London, UK. In the COMET project repeated measurements of several
metabolites over time were collected which are taken from rats subjected to different drug
treatments. The aim of our study is to detect important metabolites by analysing the multivariate
time series.
Multivariate functional PCA is an exploratory technique to describe the observed time
series. In its standard form, PCA involves linear combinations of all variables (i.e. metabolite
peaks) and does not perform variable selection. In order to select a subset of important
metabolites we introduce sparsity into the model. We develop a novel functional Sparse
Grouped Principal Component Analysis (SGPCA) algorithm using ideas related to Least
Absolute Shrinkage and Selection Operator (LASSO), a regularized regression technique,
with grouped variables. This SGPCA algorithm detects a sparse linear combination of
metabolites which explain a large proportion of the variance. Apart from SGPCA, we also propose two alternative approaches for metabolite selection. The first one is based on
thresholding the multivariate functional PCA solution, while the second method computes
the variance of each metabolite curve independently and then proceeds to these rank curves
in decreasing order of importance. To the best of our knowledge, this is the first application
of sparse functional PCA methods to the problem of modelling multivariate metabonomic
time series data and selecting a subset of metabolite peaks.
We present comprehensive experimental results using simulated data and COMET project
data for different multivariate and functional PCA variants from the literature and for SGPCA
. Simulation results show that that the SGPCA algorithm recovers a high proportion
of truly important metabolite variables. Furthermore, in the case of SGPCA applied to the
COMET dataset we identify a small number of important metabolites independently for
two different treatment conditions. A comparison of selected metabolites in both treatment
conditions reveals that there is an overlap of over 75 percent.
Date Issued
2013-02
Date Awarded
2013-03
Copyright Statement
Attribution NoDerivatives 4.0 International Licence (CC BY-ND)
Advisor
Montana, Giovanni
Sponsor
Engineering and Physical Sciences Research Council (EPSRC)
Publisher Department
Mathematics
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)