Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer

Abstract

Clustering is used widely in ‘omics’ studies and is often tackled with standard methods such as hierarchical clustering or k-means which are limited to a single data type. In addition, these methods are further limited by having to select a cut-off point at specific level of dendrogram- a tree diagram or needing a pre-defined number of clusters respectively. The increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data, for example, presence or absence of mutations, binding, motifs, and/or epigenetic marks and continuous data, for example, gene expression, protein abundance and/or metabolite levels.
In this work, we presented a generic method based on a probabilistic model for clustering this mixture of data types, and illustrate its application to genetic regulation and the clustering of cancer samples. It uses penalized maximum likelihood (ML) estimation of mixture model parameters using information criteria (model selection objective function) and meta-heuristic searches for optimum clusters. Compatibility of several information criteria with our model-based joint clustering was tested, including the well-known Akaike Information Criterion (AIC) and its empirically determined derivatives (AICλ), Bayesian Information Criterion (BIC) and its derivative (CAIC), and Hannan-Quinn Criterion (HQC). We have experimentally shown with simulated data that AIC and AIC (λ=2.5) worked well with our method.
We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival.

Metadata

Supervisors:	Westhead, David Robert and Boyes, Joan
Related URLs:	Published article using PhD thesis findings (Publisher) FlexiCoClustering- A clustering software of mixed omics data types of continuous and binary nature. (Research data)
Keywords:	clustering, mixture-model, next-generation sequencing, multi-omics, transcriptomics, genomics, ChIP-seq, RNA-seq, mutations, gene-expression, transcriptional regulatory networks, survival-analysis, cancer, AML, yeast
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Biological Sciences (Leeds) > Institute for Molecular and Cellular Biology (Leeds)
Identification Number/EthosID:	uk.bl.ethos.729461
Depositing User:	Miss Fatin Nurzahirah Zainul Abidin
Date Deposited:	05 Dec 2017 11:58
Last Modified:	25 Jul 2018 09:56
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:18883

Downloads

Final eThesis - complete (pdf)

Filename: PhD_Fatin_Zainul_Abidin_Thesis_2017.pdf

Description: pdf copy of Fatin N. Zainul Abidin thesis

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

CLICK TO DOWNLOAD

[thumbnail of pdf copy of Fatin N. Zainul Abidin thesis]

Supplementary Material

Filename: FlexiCoClusteringPackage-master.zip

Description: FlexiCoClustering software

Licence:
This work is licensed under a GNU GPL Licence

CLICK TO DOWNLOAD

[thumbnail of FlexiCoClustering software]

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer

Abstract

Metadata

Downloads

Final eThesis - complete (pdf)

Supplementary Material

Export

Statistics

Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer

Abstract

Metadata

Downloads

Final eThesis - complete (pdf)

Supplementary Material

Related datasets

Export

Statistics