# Feature Selection from Higher Order Correlations

Zhang, Zhihong (2012) Feature Selection from Higher Order Correlations. PhD thesis, University of York.

 Preview
Text
PhD_thesis-_Zhihong_Zhang.pdf
This thesis addresses the problems in feature selection, particularly focusing on selecting features from higher order correlations. To this end, we present two supervised feature selection approaches named \emph{Graph based Information-theoretic Feature Selection} and \emph{Hypergraph based Information-theoretic Feature Selection} respectively, which are capable of considering third or even higher order dependencies between the relevant features and capturing the optimal size of relevant feature subset. Furthermore, we develop two unsupervised feature selection methods which can evaluate features jointly rather than individually. In this case, larger feature combinations are considered. The reason for this is that although an individual feature may have limited relevance to a particular class, when taken in combination with other features it may be strongly relevant to the class. In Chapter $2$, we thoroughly review the relevant literature of the classifier independent (filter-based) feature selection methods. One dominant direction of research in this area is exemplified by the so-called information theoretic feature selection criteria, which is measuring the mutual dependence of two variables. Another influential direction is the graph-based feature selection methods, which are to select the features that best preserve the data similarity or a manifold structure derived from the entire feature set. We notice that most existing feature selection methods evaluate features individually or just simply consider pairwise feature interaction, and hence cannot handle redundant features. Another shortcoming of existing feature selection methods is that most of them select features in a greedy way and do not provide a direct measure to judge whether to add additional features or not. To deal with this problem, they require a user to supply the number of selected features in advance. However, in real applications, it is hard to estimate the number of useful features before the feature selection process. This thesis addresses these weaknesses, and fills a gap in the literature of selecting features from higher order correlations. In Chapter $3$ we propose a graph based information-theoretic approach to feature selection. There are three novel ingredients. First, by incorporating mutual information (MI) for pairwise feature similarity measure, we establish a novel feature graph framework which is used for characterizing the informativeness between the pair of features. Secondly, we locate the relevant feature subset (RFS) from the feature graph by maximizing features' average pairwise relevance. The RFS is expected to have little redundancy and very strong discriminating power. This strategy reduces the optimal search space from the original feature set to the relatively smaller relevant feature subset, and thus enable an efficient computation. Finally, based on RFS, we evaluate the importance of unselected features by using a new information theoretic criterion referred to as the multidimensional interaction information (MII). The advantage of MII is that it can go beyond pairwise interaction and consider third or higher order feature interactions. As a result, we can evaluate features jointly, and thus avoid the redundancies arising in individual feature combinations. Experimental results demonstrate the effectiveness of our feature selection method on a number of standard data-sets. In Chapter $4$, we find that in some situations the graph representation for relational patterns can lead to substantial loss of information. This is because in real-world problems objects and their features tend to exhibit multiple relationships rather than simple pairwise ones. This motive us to establish a feature hypergraph (rather than feature graph) to characterize the multiple relationships among features. We draw on recent work on hyper-graph clustering to select the most informative feature subset (mIFS) from a set of objects using high-order (rather than pairwise) similarities. There are two novel ingredients. First, we use MII to measure the significance of different feature combinations with respect to the class labels. Secondly, we use hypergraph clustering to select the most informative feature subset (mIFS), which has both low redundancy and strong discriminating power. The advantage of MII is that it incorporates third or higher order feature interactions. Hypergraph clustering, which extracts the most informative features. The size of the most informative feature subset (mIFS) is determined automatically. Experimental results demonstrate the effectiveness of our feature selection method on a number of standard data-sets. In addition to the supervised feature selection methods, we present two novel unsupervised feature selection methods in Chapter $5$ and Chapter $6$. Specifically, we propose a new two-step spectral regression technique for unsupervised feature selection in Chapter $5$. In the first step, we use kernel entropy component analysis (kECA) to transform the data into a lower-dimensional space so as to improve class separation. Second, we use $\ell_{1}$-norm regularization to select the features that best align with the data embedding resulting from kECA. The advantage of kECA is that dimensionality reducing data transformation maximally preserves entropy estimates for the input data whilst also best preserving the cluster structure of the data. Using $\ell_{1}$-norm regularization, we cast feature discriminant analysis into a regression framework which accommodates the correlations among features. As a result, we can evaluate joint feature combinations, rather than being confined to consider them individually. Experimental results demonstrate the effectiveness of our feature selection method on a number of standard face data-sets. In Chapter $6$, by incorporating MII for higher order similarities measure, we establish a novel hypergraph framework which is used for characterizing the multiple relationships within a set of samples (e.g. face samples under varying illumination conditions). Thus, the structural information latent in the data can be more effectively modeled. We then explore a strategy to select the discriminating feature subset on the basis of the hypergraph representation. The strategy is based on an unsupervised method which derive the hypergraph embedding view of feature selection. We develop the strategy based on a number of standard image datasets, and the results demonstrate the effectiveness of our feature selection method. We summarize the contributions of this thesis in Chapter $7$, and analyze the developed methods. Finally, we give some suggestions to the future work in feature selection.