Ummi, Maharani Ahsani ORCID: https://orcid.org/0000-0001-9368-6678 (2023) Multiscale copy number alteration analysis using wavelets. PhD thesis, University of Leeds.
Abstract
The need for multiscale modelling comes from the fact that it is rare for measured data to contain contributions at a single scale. For example, a typical signal from an experimental process may contain contributions from a variety of sources, such as noise and faults. These features usually occur with different localisation and at different locations in time and frequency. It is also inevitable for copy number DNA sequencing. Identifying Copy Number Alteration (CNA) from a sample cell faces difficulties due to errors, different sizes of reads being recorded, infiltration from normal cells, and different sizes of test and normal genomes. Thus, the representation of the measurements in terms of multiscale offers efficient feature extraction or noise removal from a typical process signal.
One of the powerful tools used to extract the multiscale characteristics of the observed data is wavelets. Wavelets are mathematical expansions that are able to transform data from the time domain into different layers of frequency levels. In this thesis, wavelets are used, first, to segment the CNA data into regions of equal copy number and secondly, to extract useful information from the original data for a better prediction of tumour subtypes. For the first purpose, an approach called TGUHm method is presented which applies the tail-greedy unbalanced Haar (TGUH) wavelet transform to perform segmentation of CNA data. The `unbalanced' characteristic of the TGUH approach gives the advantage that the data length does not have to be a power of two as in the traditional discrete Haar wavelet method. An additional benefit is it can address the problem that commonly arises in Haar wavelet estimation where the estimator is more likely to detect jumps at dyadic locations which might not be the actual locations of the jumps/drops in the true underlying CNA pattern.
The TGUHm method is then applied to the existing data-driven wavelet-Fisz methodology to deal with the heteroscedastic noise problem that we often find in CNA data. In practice, real CNA data deviate from homoscedastic noise assumption and indicate some dependencies of the variance on the mean value. The proposed method performs variance stabilisation to bring the problem into a homoscedastic model before applying a denoising procedure. The use of the unbalanced Haar wavelet also makes it possible to estimate short segments better than the balanced Haar wavelet-based segmentation methods. Moreover, our simulation study indicates that the proposed methodology has substantial advantages in estimating both short and long-altered segments in copy number data with heteroscedastic error variance.
For the second purpose, a wavelet-based classification framework was proposed which employs non-decimated Haar wavelet transform to extract localised differences and means of the original data into several scales. The wavelet transformation decomposes the original data into detail (localised difference) and scaling (localised means) coefficients into different resolution levels. This would bring an advantage to discover hidden features or information which are difficult to find from original data only. Each resolution level corresponds to a different length of wavelet basis and by considering which levels are most useful in a model, the length of the region that is responsible for the prediction could be identified.
Metadata
Supervisors: | Gusnanto, Arief and Barber, Stuart |
---|---|
Keywords: | copy number alteration, wavelets analysis, change-points detection, piecewise-constant estimators, lung cancer, logistic regression |
Awarding institution: | University of Leeds |
Academic Units: | The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds) |
Depositing User: | Ms Maharani Ahsani Ummi |
Date Deposited: | 11 Oct 2023 14:57 |
Last Modified: | 11 Oct 2023 14:57 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:33540 |
Download
Final eThesis - complete (pdf)
Filename: Ummi_MA_Mathematics_PhD_2023.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.