Classification Trees for high-dimensional highly-correlated data

Abstract

This thesis focuses on building classification model to classify tumour
subtypes of lung cancer using CNA estimates datasets. As a genomic
datasets, these CNA estimates datasets have much more variables
than observations and blocks of correlated variables. There are also
several variables of high variance. In terms of data classification, these
datasets have many variables as strong relevant predictors and larger
number of variables as weak relevant predictors.

Applying Classification Trees on such this dataset we have to cope
with three issues. Firstly, fitting Classification Trees using datasets of
much more variables than observations, we only utilize the discriminatory information contained in small number of variables which can be
seen as strong relevant predictors and leave large number of variables
including the informative ones. Whereas, some of these variables are
important for prediction.

Secondly, having variables which contain the discriminative information with high data variation yields less accurate Classification Trees.
Finally, in the light of cross validated estimation for the prediction
error, we find that having datasets of small sample size causes the
unstable error rate estimates.

To overcome these issues we make use of PCA as a data dimensionality reduction method. We apply PCA prior to Classification Trees
construction to reduce data dimensionality while retain the variation
of the data. However, applying PCA does not improve Classification
Trees performance in terms of the prediction error rate. PCA produces
new variables by finding linear combinations of original datasets with
maximum variance. Hence, the resulting principal component scores
might be more affected by variables of extremely high variances. In
addition, PCA is also strongly affected by the presence of many blocks
of correlated variables.

Furthermore, we apply ICA as a feature extraction method prior to
Classification Trees construction. However, applying this feature extraction method does not improve the resulting Classification Trees
as well. There are two reasons for this. Firstly, the whitening step
applied prior to maximising non-Gaussianity during the estimation
of independent components greatly reduces data dimension as PCA
does. Therefore, we end up with nearly similar results as applying
PCA. Secondly, ICA estimates involve maximising non-Gaussianity
on the whitened variables. This maximisation only deals with nonGaussianity and might not exploit the discriminative information contained in dataset.

Finally, we apply Random Forests to overcome the issues mentioned
above. Random Forests involve subsetting of variables for splitting
each node in an individual tree. This in turn enables weak relevant
predictors to be selected as the classifying ones. Apart from the number of variables to be picked for splitting each node in an individual
tree, there are other two tuning parameters: number of trees to be
generated and number of observations in each terminal node of an
individual trees.
We should search for the optimal setting of these hyperparameters
for the forest to produce Random Forest with low error rate. For the
number of observations in a terminal node we recommend the use of
one observation. For the number of trees, based on our simulation
studies, we find that one should generate at least 500 trees in order
to obtain a stable estimate. However, we cannot prescribe one suggestion for the number of variables to be selected. For our datasets,
we find that for Smooth CNA dataset we have to set this hyperparameter small. On contrast, for DNACopy CNA dataset, this tuning
parameter should be set to become large for getting Random Forests
with small error rate.

With regard to the comparison across the models explained above,
we recommend the use of Random Forests rather than Classification
Trees. We also do not suggest the application of both PCA and ICA
as the data dimension reduction methods. In terms of prediction error, Random Forests give the lowest error rate. In the light of the
insight underlying the resulting classification model, Random Forests
are able to precisely produce the accurate classifiers. For DNACopy
dataset, Random Forests successfully identify the contribution of variables within both Chromosomes 3 and 10. Whereas Classification
Trees only recognise the contribution of those of Chromosome 10.
However, in terms of computational time, Random Forests are expensive. Compared to the other models, Random Forests spend the
longest time.

Furthermore, with respect to the genetic point of view for the resulting Classification Trees and Random Forests for both Smooth CNA
and DNACopy CNA datasets, we end up with different results. For
Classification Trees of Smooth CNA dataset, we obtain genes SOX2
and PIK3CA as the genetic markers. Meanwhile, for Classification
Trees of DNACopy CNA dataset, we attains gene KIF5B and RET as
the genetic markers. Moreover, for Random Forests of Smooth CNA
datasets, we get the same result as of Classification Trees. Nevertheless, for Random Forests of DNACopy CNA dataset, we end up
with genes KIF5B, RET, SOX2 and PIK3CA as the genetic markers.
The amplification of genes SOX2 and PIK3CA within loci 3q24 up to
3q27.3 is common in squamous carcinoma lung cancer. The fusion between gene KIF5B in locus 10p11.22 and gene RET in locus 10q11.21
is common in adeno carcinoma lung cancer.

Metadata

Supervisors:	Gusnanto, Arief and Voss, Jochen and Taylor, Charles C
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds)
Depositing User:	Mr Dodi Vionanda
Date Deposited:	24 Mar 2021 15:29
Last Modified:	24 Mar 2021 15:29
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:28365

Download

Final eThesis - complete (pdf)

Embargoed until: 1 March 2026

Please use the button below to request a copy.

Filename: Dodi Vionanda.pdf

Request a copy

Please use the 'Request a copy' link(s) in the 'Downloads' section above to request this thesis. This will be sent directly to someone who may authorise access.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Classification Trees for high-dimensional highly-correlated data

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics