Partial Least Squares Regression for High Dimensional and Correlated Data

Abstract

This thesis focuses on the investigation of partial least squares (PLS) method- ology to deal with high-dimensional correlated data. Current develop- ments in technology have enabled experiments to produce data that are characterised by, first, the number of variables that far exceeds the number of observations and, second, variables that are substantially correlated be- tween them. These types of data are common to be found in, first, chemo- metrics where absorbance levels of chemical samples are recorded across hundreds of wavelengths in a calibration of near-infrared (NIR) spectrom- eter. Second, they are also common to be found in genomics where copy number alterations (CNA) are recorded across thousands of genomic re- gions from cancer patients. PLS is a well-known method to employ in the analysis of high-dimensional data as a regression method in chemo- metric data or as a classification method in genomic data. It deals with those characteristics of the data by constructing latent variables, called components, to represent the original variables. However, there are some challenges in the application of PLS for such analysis and, in this research, there are several areas of investigation that we have performed to deal with them. The first one is that there are three main PLS algorithms with po- tentially different interpretation of relevant quantities. We deal with this problem by consolidating those three algorithms and identify the case in which those three algorithms would give the same estimates. The second one is the unusual negative shrinkage factors (or “filter factors”) that PLS experiences in the model fitting. One of the main reasons PLS can deal with high-dimensional data is that the estimates experience a shrinkage. Unlike ridge regression or principal component regression that experience shrinkage factors between zero and one, PLS can experience shrinkage factors more than one or even negative (hence, more appropriate to be called “filter factors” than “shrinkage factors”). To our knowledge, there has been no previous meaningful investigation on the negative filter fac- tors (NFF) in PLS. In this research we present a novel result whereby we identify the condition for NFF to happen and investigate characteristics of the data that are associated with NFF to get an insight. Lastly, the main challenge of the application of PLS is in the interpretation of weights as- sociated with the predictors. With hundreds and thousands of predictors, each and every predictor variable has non-zero weight. However, we ex- pect that only some predictor variables are contributing to the association with the outcome variable. We therefore resort to the sparse estimation of predictor weights where some weights are zero estimated and the other weights are non-zero. A (standard) lasso estimation has a weakness in dealing with correlated variables as it picks up one variable within a corre- lation “block” without knowing the reason. A novel approach is needed to take into account the dependencies between predictor variables in estimat- ing the weights. We propose a new method where a new penalty function is introduced in the likelihood function associated with the estimation of weights. The penalty function is a combination of a lasso penalty that im- poses sparsity and a penalty based on Cauchy distribution with a smoother matrix to take into account dependencies between genomic regions. The results show that the estimates of the weights are sparse: many weights are zero estimated, and those non-zero estimates are grouped and exhibit smoothness within them. The interpretation on genomic regions becomes easy and identification of important regions for each component can be done simultaneously with prediction in a single modelling framework. We investigate the relation between PLS and graphical modelling using the in- formation in the weights to construct the graph with unsuccessful results.

Metadata

Supervisors:	Gusnanto, Arief and Taylor, Charles
Keywords:	PLS , shrinkage factors, filter factors, negative filter factors, smoothed sparse PLS, graphical modelling.
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds)
Depositing User:	Mr Mohammed A. A. Alshahrani
Date Deposited:	18 Oct 2019 12:07
Last Modified:	02 Sep 2024 08:05
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:24587

Download

Final eThesis - complete (pdf)

Filename: thesis.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Partial Least Squares Regression for High Dimensional and Correlated Data

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics