Machine learning applied to genotype and omics data integration for the study of loci-trait associations
File(s)
Author(s)
Leal Ayala, Luis Guillermo
Type
Thesis or dissertation
Abstract
The genetic basis of human diseases remains elusive in spite of substantial research. A huge volume of genotyping data are being produced to understand the association between single nucleotide variants (SNVs) and traits, facilitate diagnoses and outcome assessment. It is also known that regression models used in Genome-Wide Association Studies (GWAS) identify loci with strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Therefore, strategies for data reduction and integration are required to expand the landscape of disease-causative genes.
The main goal of this research is to develop a machine learning algorithm for merging genotype data with other omics data and enhance the prioritisation of disease-associated genes. This thesis details the development of a data fusion framework called Corrected Non-negative Matrix Factorisation (cNMTF) for prioritising SNV/genes with biological and genetic relevance in categorical traits. cNMTF captures the interrelatedness between variants data, gene networks, patients' phenotypes and their ancestry.
cNMTF was evaluated in the prioritisation of genes and SNVs associated with lipid traits (low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol and triglycerides) in Finnish and American cohorts. Under the principle of guilt-by-association, the method was tuned to enrich the output with previously identified genes ($p=9\times10^{-14}$, hypergeometric test), while capturing novel candidate genes that could play a biological role in patients with extreme lipid levels. A total of 265 genes were prioritised, including novel pleiotropic genes and regulators of lipid metabolism.
My algorithm represents an advancement in the integration of heterogeneous data to leverage association signals and complement GWAS results using networks. Its mathematical formulation could help in the development of further integrative approaches, and its results can complement state-of-art methods for mutually enhancing gene target discovery.
The main goal of this research is to develop a machine learning algorithm for merging genotype data with other omics data and enhance the prioritisation of disease-associated genes. This thesis details the development of a data fusion framework called Corrected Non-negative Matrix Factorisation (cNMTF) for prioritising SNV/genes with biological and genetic relevance in categorical traits. cNMTF captures the interrelatedness between variants data, gene networks, patients' phenotypes and their ancestry.
cNMTF was evaluated in the prioritisation of genes and SNVs associated with lipid traits (low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol and triglycerides) in Finnish and American cohorts. Under the principle of guilt-by-association, the method was tuned to enrich the output with previously identified genes ($p=9\times10^{-14}$, hypergeometric test), while capturing novel candidate genes that could play a biological role in patients with extreme lipid levels. A total of 265 genes were prioritised, including novel pleiotropic genes and regulators of lipid metabolism.
My algorithm represents an advancement in the integration of heterogeneous data to leverage association signals and complement GWAS results using networks. Its mathematical formulation could help in the development of further integrative approaches, and its results can complement state-of-art methods for mutually enhancing gene target discovery.
Version
Open Access
Date Issued
2021-03
Date Awarded
2021-10
Copyright Statement
Creative Commons Attribution NonCommercial Licence
Advisor
Sternberg, Michael
Sponsor
Imperial College London
Publisher Department
Life Sciences
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)