Projection Based Models for High Dimensional Data
Author(s)
McWilliams, Brian Victor Parulian
Type
Thesis
Abstract
In recent years, many machine learning applications have arisen which deal with the
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance.
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance.
Date Issued
2011-09
Date Awarded
2012-04
Copyright Statement
Attribution NoDerivatives 4.0 International Licence (CC BY-ND)
Advisor
Montana, Giovanni
Gandy, Axel
Publisher Department
Mathematics
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)