White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Clustering Methodology for Bivariate Circular Data with Application to Protein Dihedral Angles

Abushilah, Samira Faisal Hathoot (2019) Clustering Methodology for Bivariate Circular Data with Application to Protein Dihedral Angles. PhD thesis, University of Leeds.

[img] Text
Abushilah_SFH_Mathematics_PhD_2019.pdf - Final eThesis - complete (pdf)
Restricted until 1 November 2024.

Request a copy

Abstract

This thesis focusses on the development of statistical methodologies that can deal with bivariate circular data in the context of protein bioinformatics. Circular data differs from traditional linear data and statistical methods to handle the unique nature of this type of data are relatively new and are still under development. In circular data we focus on the dihedral angles that describe the conformation of the protein backbone. There are many problems related to circular data, and in this research we focus on some of them. Although experimental biological techniques can determine the structure and function of protein, such techniques are expensive and very time-consuming. Clustering of amino acids remains a challenging problem in protein bioinformatics which can help to predict whether a substitution of one amino acid by another has an essential impact on the protein structure, hence its function. Various researchers have attempted to cluster amino acids using physical properties, we regard this as suboptimal when the protein structure and function is the main interest. Therefore, we firstly propose a novel methodology to cluster groups of bivariate circular data and this is used to cluster 20 amino acids by considering the dissimilarity in the bivariate distributions of the dihedral angles. This dissimilarity can be expressed as a p-value of a permutation test for any pair of amino acids and we use this to obtain our own clusters. This clustering is then compared to other amino acid classifications using similarity indices. The above mentioned p-values can be obtained by a permutation test which for large sample sizes takes much computational time. Consequently, we secondly consider two novel homogeneity tests and develop distributional results based on theoretical asymptotic considerations on the distribution of the new proposed test statistic. The properties and distributions of our parametric tests are investigated and their performance is examined using simulated data (normal samples and von Mises samples). One of the tests is applied also to our real data, protein dihedral angles, for which clustering is carried out as before. It is also biologically important to know the properties of amino acids, where these characteristics exert an effect on the biological activity of protein and on its structure. In biochemistry, it is well known that the structure of some molecules, such as proteins, DNA and RNA, can be described in terms of conformational angles, for proteins these angles could be dihedral angles. Since each amino acid corresponds to a pair of dihedral angles, then the pattern of dihedral angles distribution across proteins is one way to determine amino acid characteristics. Therefore, we thirdly develop an approach to kernel density estimation on the torus and this is used to estimate the distribution of dihedral angles, which belong to each amino acid across proteins. An initial step requires choice of two smoothing parameters, which we investigate. Then, the estimated bivariate kernel density under the two smoothing parameters can be processed using mathematical morphology to partition the sample space of densities. By using this methodology, a researcher can divide the bivariate circular data into groups without being given the number of clusters a priori.

Item Type: Thesis (PhD)
Keywords: Circular statistics, Protein dihedral angles, Permutation two-sample test, Energy statistic, Hierarchical clustering, Similarity indices, Kernel density estimation, Mathematical morphology, von Mises distribution
Academic Units: The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds)
Depositing User: Dr Samira Faisal Hathoot Abushilah
Date Deposited: 01 Oct 2019 09:43
Last Modified: 01 Oct 2019 09:43
URI: http://etheses.whiterose.ac.uk/id/eprint/24730

Please use the 'Request a copy' link(s) above to request this thesis. This will be sent directly to someone who may authorise access.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)