Vionanda, Dodi (2020) Classification Trees for high-dimensional highly-correlated data. PhD thesis, University of Leeds.
Abstract
This thesis focuses on building classification model to classify tumour subtypes of lung cancer using CNA estimates datasets. As a genomic datasets, these CNA estimates datasets have much more variables than observations and blocks of correlated variables. There are also several variables of high variance. In terms of data classification, these datasets have many variables as strong relevant predictors and larger number of variables as weak relevant predictors. Applying Classification Trees on such this dataset we have to cope with three issues. Firstly, fitting Classification Trees using datasets of much more variables than observations, we only utilize the discriminatory information contained in small number of variables which can be seen as strong relevant predictors and leave large number of variables including the informative ones. Whereas, some of these variables are important for prediction. Secondly, having variables which contain the discriminative information with high data variation yields less accurate Classification Trees. Finally, in the light of cross validated estimation for the prediction error, we find that having datasets of small sample size causes the unstable error rate estimates. To overcome these issues we make use of PCA as a data dimensionality reduction method. We apply PCA prior to Classification Trees construction to reduce data dimensionality while retain the variation of the data. However, applying PCA does not improve Classification Trees performance in terms of the prediction error rate. PCA produces new variables by finding linear combinations of original datasets with maximum variance. Hence, the resulting principal component scores might be more affected by variables of extremely high variances. In addition, PCA is also strongly affected by the presence of many blocks of correlated variables. Furthermore, we apply ICA as a feature extraction method prior to Classification Trees construction. However, applying this feature extraction method does not improve the resulting Classification Trees as well. There are two reasons for this. Firstly, the whitening step applied prior to maximising non-Gaussianity during the estimation of independent components greatly reduces data dimension as PCA does. Therefore, we end up with nearly similar results as applying PCA. Secondly, ICA estimates involve maximising non-Gaussianity on the whitened variables. This maximisation only deals with nonGaussianity and might not exploit the discriminative information contained in dataset. Finally, we apply Random Forests to overcome the issues mentioned above. Random Forests involve subsetting of variables for splitting each node in an individual tree. This in turn enables weak relevant predictors to be selected as the classifying ones. Apart from the number of variables to be picked for splitting each node in an individual tree, there are other two tuning parameters: number of trees to be generated and number of observations in each terminal node of an individual trees. We should search for the optimal setting of these hyperparameters for the forest to produce Random Forest with low error rate. For the number of observations in a terminal node we recommend the use of one observation. For the number of trees, based on our simulation studies, we find that one should generate at least 500 trees in order to obtain a stable estimate. However, we cannot prescribe one suggestion for the number of variables to be selected. For our datasets, we find that for Smooth CNA dataset we have to set this hyperparameter small. On contrast, for DNACopy CNA dataset, this tuning parameter should be set to become large for getting Random Forests with small error rate. With regard to the comparison across the models explained above, we recommend the use of Random Forests rather than Classification Trees. We also do not suggest the application of both PCA and ICA as the data dimension reduction methods. In terms of prediction error, Random Forests give the lowest error rate. In the light of the insight underlying the resulting classification model, Random Forests are able to precisely produce the accurate classifiers. For DNACopy dataset, Random Forests successfully identify the contribution of variables within both Chromosomes 3 and 10. Whereas Classification Trees only recognise the contribution of those of Chromosome 10. However, in terms of computational time, Random Forests are expensive. Compared to the other models, Random Forests spend the longest time. Furthermore, with respect to the genetic point of view for the resulting Classification Trees and Random Forests for both Smooth CNA and DNACopy CNA datasets, we end up with different results. For Classification Trees of Smooth CNA dataset, we obtain genes SOX2 and PIK3CA as the genetic markers. Meanwhile, for Classification Trees of DNACopy CNA dataset, we attains gene KIF5B and RET as the genetic markers. Moreover, for Random Forests of Smooth CNA datasets, we get the same result as of Classification Trees. Nevertheless, for Random Forests of DNACopy CNA dataset, we end up with genes KIF5B, RET, SOX2 and PIK3CA as the genetic markers. The amplification of genes SOX2 and PIK3CA within loci 3q24 up to 3q27.3 is common in squamous carcinoma lung cancer. The fusion between gene KIF5B in locus 10p11.22 and gene RET in locus 10q11.21 is common in adeno carcinoma lung cancer.
Metadata
Supervisors: | Gusnanto, Arief and Voss, Jochen and Taylor, Charles C |
---|---|
Awarding institution: | University of Leeds |
Academic Units: | The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds) |
Depositing User: | Mr Dodi Vionanda |
Date Deposited: | 24 Mar 2021 15:29 |
Last Modified: | 24 Mar 2021 15:29 |
Download
Final eThesis - complete (pdf)
Embargoed until: 1 March 2026
Please use the button below to request a copy.
Filename: Dodi Vionanda.pdf

Export
Statistics
Please use the 'Request a copy' link(s) in the 'Downloads' section above to request this thesis. This will be sent directly to someone who may authorise access.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.