Gaussian Process in Computational Biology: Covariance Functions for Transcriptomics

Abstract

In the field of machine learning, Gaussian process models are widely used families of stochastic process for modelling data observed over time, space or both. Gaussian processes models are nonparametric, meaning that the models are developed on an infinite-dimensional parameter space. The parameter space is then typically learnt as the set of all possible solutions for a given learning problem. Gaussian process distributions are distribution over functions. The covariance function determines the properties of functions samples drawn from the process. Once the decision to model with a Gaussian process has been made the choice of the covariance function is a central step in modelling.

In molecular biology and genetics, a transcription factor is a protein that binds to specific DNA sequences and controls the flow of genetic information from DNA to mRNA. To develop models of cellular processes, quantitative estimation of the regulatory relationship between transcription factors and genes is a basic requirement. Quantitative estimation is complex due to various reasons. Many of the transcription factors' activities and their own transcription level are post transcriptionally modified; very often the levels of the transcription factors' expressions are low and noisy. So, from the expression levels of their target genes, it is useful to infer the activity of the transcription factors. Here we developed a Gaussian process based nonparametric regression model to infer the exact transcription factor activities from a combination of mRNA expression levels and DNA-protein binding measurements.

Clustering of gene expression time series gives insight into which genes may be coregulated, allowing us to discern the activity of pathways in a given microarray experiment. Of particular interest is how a given group of genes varies with different conditions or genetic backgrounds. In this thesis, we developed a new clustering method that allows each cluster to be parametrized according to the behaviour of the genes across conditions whether they are correlated or anti-correlated. By specifying the correlation between such genes, we gain more information within the cluster about how the genes interrelate. Our study shows the effectiveness of sharing information between replicates and different model conditions while modelling gene expression time series.

Metadata

Supervisors:	Lawrence, Neil
Publicly visible additional information:	Here the date of the final copy of thesis submission is February 2018. The thesis was submitted to the examiner with minor corrections in September 2017, and no amendment has done afterword.
Keywords:	Gaussian Process, Kernel, Gene expression time series, Transcription factor activity, hierarchical clusters, Coregionalization, ALS
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Identification Number/EthosID:	uk.bl.ethos.736571
Depositing User:	Mr Muhammad Arifur Rahman
Date Deposited:	19 Mar 2018 15:08
Last Modified:	12 Oct 2018 09:52
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:19460

Download

Muhammad_110121714_Final

Filename: Muhammad_110121714_Final.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License

CLICK TO DOWNLOAD

[thumbnail of Muhammad_110121714_Final.pdf]

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Gaussian Process in Computational Biology: Covariance Functions for Transcriptomics

Abstract

Metadata

Download

Muhammad_110121714_Final

Export

Statistics