Hu, Peiyun (2023) Classification of high-dimensional mislabelled data and online algorithms for high-dimensional streaming data. PhD thesis, University of York.
Abstract
Motivated by extensive discussions and applications of big data, we delve into the realm of sparse data, specifically high dimensional data characterised by a larger number of predictors than sample sizes. The advantages and challenges associated with high-dimensional data have been thoroughly discussed (Donoho et al., 2000). Our research primarily focuses on addressing the challenges on two prevalent domains: Classification with mislabelled data and Online algorithms for streaming data. To overcome these challenges, we incorporate regularisation methods and utilise Sure Independence Screening (SIS) and Iteratively Sure Independence Screening (ISIS) (Fan and Lv, 2008, Fan and Song, 2010). In Chapter 3, we introduce a two-step estimation method using resampling for classification with mislabelling, offering enhanced cost-effectiveness over conventional data cleansing. Simulations reveal that direct training on corrupted datasets leads classifiers like Logistic Regression (LR) to perform akin to random guessing. Our method greatly enhances LR classifier efficiency, matching the performance of classifiers on perfectly labelled datasets. Notably, our method aligns closely with the performance of the Bayes classifier in diverse contexts. Real data analysis, using a deliberately mislabelled Framingham Heart Study dataset, underscores our classifier’s superiority over one trained on raw data with mislabelling, comparable with one trained on impeccable data. In Chapter 4, we explore incremental algorithms for streaming data, focusing on Generalised Linear Models (GLMs). Our methodologies parallel offline techniques in both low and high-dimensional analyses but excel in computational efficiency. A highlight of our approach is the avoidance of storing specific data, optimising resources and boosting data security. Analysing data from the National Automotive Sampling System Crashworthiness Data System showcases our method’s superiority in estimation accuracy, variable selection, and model interpretation. Our technique significantly outperforms those neglecting variable selection and aligns with conventional offline methods.
Metadata
Supervisors: | Zhang, Wenyang |
---|---|
Keywords: | Classification with mislabelled data; High-dimensional classification; Incremental algorithms; High-dimensional streaming data; Variable selections |
Awarding institution: | University of York |
Academic Units: | The University of York > Mathematics (York) |
Depositing User: | Ms. Peiyun Hu |
Date Deposited: | 22 Sep 2023 14:25 |
Last Modified: | 22 Sep 2024 00:05 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:33536 |
Download
Examined Thesis (PDF)
Filename: Hu PhD Thesis.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.