Unsupervised analysis of mass spectral data
File(s)
Author(s)
Inglese, Paolo
Type
Thesis or dissertation
Abstract
Over the last few decades, mass spectrometry imaging (MSI) has gained increasing interest as an analytical tool for the analysis of spatial molecular patterns in samples of interest. Because of its untargeted and high-throughput nature, MSI data often consists of hundreds to thousand spectral peak images.
Statistical analysis of this type of data has been extensively used to investigate the relationship between the observed molecular patterns and the local properties of the sample. For example, supervised learning techniques can be employed to segment the tissue specimens into histologically relevant areas. In general, these regions must be identified by an expert histopathologist by visual inspection of the optical image of the tissue specimen; for instance, by employing haematoxylin and eosin (H&E) staining. Using this approach, lists of significantly up/down-regulated ions in the various tissue regions can be identified by univariate or multivariate statistical modelling.
Unfortunately, these methods cannot be used when there is no certainty about a direct relationship between histological characteristics and molecular patterns. This situation is typical of analyses where underlying models of the local molecular interactions are unknown, such as in cancerous tissue. In these cases, different data analysis tools must be employed.
Unsupervised data analysis provides a class of techniques and models that can identify data patterns without requiring external information common to objects of the same class. The purpose of these methods is to identify statistical properties common to subsets of the analysed data and to generate a partition based on these. In the case of MSI, unsupervised methods usually employed fall into the category of clustering. The purpose of these methods is to generate a partition of the measured spectra, accordingly to the similarity between their ion patterns, and assign these to the corresponding pixels, thus providing a molecular-based segmentation of the sample. However, the lack of a ground truth makes the clustering challenging, requiring various strategies for validating the quality of the observed results.
Furthermore, the unprocessed MSI data often contains signals which are the result of the specific analytical technique used to extract the molecular content of the sample of interest. For instance, MALDI contains signals associated with the chemical matrix used to enhance the ion desorption, or, in the case of DESI, solvent-related signals are present in the final dataset. It is evident that these sample-unrelated signals can interfere with the results of the unsupervised analysis.
For this reason, a series of filters, published in an R package, called SPatially aUTomatic deNoising for Ims toolKit (SPUTNIK), are presented. The filters aim to identify and remove spectral signals that are not likely to be sample-related. The results show that this approach not only significantly reduces the size of the data, but also improves the quality of the clustering results.
Subsequently, the benefit of dimensionality reduction (DR) techniques in determining the optimal number of clusters in large MSI data is investigated. It is shown that although standard linear methods, such as principal component analysis (PCA), cannot provide an accurate and comprehensive picture of the statistical properties of the data, deep-learning-based (highly non-linear) methods reveal the presence of groups of mutually similar spectra. Additionally, the benefit of using a 3-dimensional (3D) tissue specimen for generating robust unsupervised partitions of the data is presented.
Finally, the information contained in MSI data about the spatial localisation of the detected molecules is exploited for the identification of groups of highly co-localised ions. Using the hypothesis that groups of highly co-localised molecules can be an expression of local metabolism, differences in the ion co-localisation patterns between metastatic and non-metastatic colorectal cancer are identified.
Statistical analysis of this type of data has been extensively used to investigate the relationship between the observed molecular patterns and the local properties of the sample. For example, supervised learning techniques can be employed to segment the tissue specimens into histologically relevant areas. In general, these regions must be identified by an expert histopathologist by visual inspection of the optical image of the tissue specimen; for instance, by employing haematoxylin and eosin (H&E) staining. Using this approach, lists of significantly up/down-regulated ions in the various tissue regions can be identified by univariate or multivariate statistical modelling.
Unfortunately, these methods cannot be used when there is no certainty about a direct relationship between histological characteristics and molecular patterns. This situation is typical of analyses where underlying models of the local molecular interactions are unknown, such as in cancerous tissue. In these cases, different data analysis tools must be employed.
Unsupervised data analysis provides a class of techniques and models that can identify data patterns without requiring external information common to objects of the same class. The purpose of these methods is to identify statistical properties common to subsets of the analysed data and to generate a partition based on these. In the case of MSI, unsupervised methods usually employed fall into the category of clustering. The purpose of these methods is to generate a partition of the measured spectra, accordingly to the similarity between their ion patterns, and assign these to the corresponding pixels, thus providing a molecular-based segmentation of the sample. However, the lack of a ground truth makes the clustering challenging, requiring various strategies for validating the quality of the observed results.
Furthermore, the unprocessed MSI data often contains signals which are the result of the specific analytical technique used to extract the molecular content of the sample of interest. For instance, MALDI contains signals associated with the chemical matrix used to enhance the ion desorption, or, in the case of DESI, solvent-related signals are present in the final dataset. It is evident that these sample-unrelated signals can interfere with the results of the unsupervised analysis.
For this reason, a series of filters, published in an R package, called SPatially aUTomatic deNoising for Ims toolKit (SPUTNIK), are presented. The filters aim to identify and remove spectral signals that are not likely to be sample-related. The results show that this approach not only significantly reduces the size of the data, but also improves the quality of the clustering results.
Subsequently, the benefit of dimensionality reduction (DR) techniques in determining the optimal number of clusters in large MSI data is investigated. It is shown that although standard linear methods, such as principal component analysis (PCA), cannot provide an accurate and comprehensive picture of the statistical properties of the data, deep-learning-based (highly non-linear) methods reveal the presence of groups of mutually similar spectra. Additionally, the benefit of using a 3-dimensional (3D) tissue specimen for generating robust unsupervised partitions of the data is presented.
Finally, the information contained in MSI data about the spatial localisation of the detected molecules is exploited for the identification of groups of highly co-localised ions. Using the hypothesis that groups of highly co-localised molecules can be an expression of local metabolism, differences in the ion co-localisation patterns between metastatic and non-metastatic colorectal cancer are identified.
Version
Open Access
Date Issued
2018-10
Date Awarded
2019-07
Copyright Statement
Creative Commons Attribution NonCommercial Licence
Advisor
Glen, Robert C
Nicholson, Jeremy K
Publisher Department
Department of Surgery & Cancer
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)