Yousefi, Fariba (2021) Gaussian Processes for Data Scarcity Challenges. PhD thesis, University of Sheffield.
Abstract
This thesis focuses on Gaussian process models specifically designed for scarce data problems. Data scarcity or lack of data can be a weak spot for many machine learning algorithms. Nevertheless, both are commonly found in a diverse set of applications such as medicine, quality assurance, and remote sensing. Supervised classification algorithms can require large amounts of labeled data, and fulfilling this requirement is not straightforward.
In medicine, breast cancer datasets typically have few cancerous cells and many healthy cells due to the overall relative scarcity of cancerous cells versus non-cancerous ones. The lack of cancerous cells causes the dataset to be imbalanced, which makes it difficult for learning algorithms to learn the differences between cancerous and healthy cells. A similar imbalance exists in the quality assurance industry, in which the ratio of faulty to non-faulty cases is very low. In sensor networks, and in particular those which measure air pollution across cities, combining sensors of different qualities can help fill gaps in what is often a very data scarce landscape.
In data scarce scenarios, we present a probabilistic latent variable model that can cope with imbalanced data. By incorporating label information, we develop a kernel that can capture shared and private characteristics of data separately. On the other hand, in cases where no labels are available, an active learning based technique is proposed, based on a Gaussian process classifier with an oracle in the loop to annotate only the data about which the algorithm is uncertain. Finally, when disparate data types with different granularity levels are available, a transfer learning based approach is proposed. We show that jointly modeling data with various granularity helps improve prediction of rare data.
The developed methods are demonstrated in experiments with real and synthetic data. The results presented in this thesis show that the developed methods improve prediction for scarce data problems with various granularities.
Metadata
Supervisors: | Mauricio, Alvarez and Neil, Lawrence |
---|---|
Keywords: | Gaussian process, multi-task learning, multi-output GPs, Imbalanced data, data scarcity |
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Engineering (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.826859 |
Depositing User: | Ms Fariba Yousefi |
Date Deposited: | 28 Mar 2021 14:15 |
Last Modified: | 01 May 2021 09:54 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:28688 |
Download
Final eThesis - complete (pdf)
Filename: Yousefi_thesis.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.