Cai, Ziyun (2017) Feature Learning for RGB-D Data. PhD thesis, University of Sheffield.
Abstract
RGB-D data has turned out to be a very useful representation for solving fundamental computer vision problems. It takes the advantages of the color images that provide appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. RGB-D image/video can facilitate a wide range of application areas, such as computer vision, robotics, construction and medical imaging. Furthermore, how to fuse RGB information and depth information is still a problem in computer vision. It is not enough to simply concatenate RGB data and depth data together. A new fusion method could better fuse RGB images and depth images. It still needs more powerful algorithms on this. In this thesis, to explore more advantages of RGB-D data, we use some popular RGB-D datasets for deep feature learning algorithms evaluation, hyper-parameter optimization, local multi-modal feature learning, RGB-D data fusion and recognizing RGB information from RGB-D images: i)With the success of Deep Neural Network in computer vision, deep features from fused RGB-D data can be proved to gain better results than RGB data only. However, different deep learning algorithms show different performance on different RGB-D datasets. Through large-scale experiments to comprehensively evaluate the performance of deep feature learning models for RGB-D image/ video classification, we obtain the conclusion that RGB-D fusion methods using CNNs always outperform other selected methods (DBNs, SDAE and LSTM). On the other side, since LSTM can learn from experience to classify, process and predict time series, it achieved better performances than DBN and SDAE in video classification tasks. ii) Hyper-parameter optimization can help researchers quickly choose an initial set of hyper-parameters for a new coming classification task, thus reducing the number of trials in terms of hyper-parameter space. We present a simple and efficient framework for improving the efficiency and accuracy of hyper-parameter optimization by considering the classification complexity of a particular dataset. We verify this framework on three real-world RGB-D datasets. After the analysis of experiments, we confirm that our framework can provide deeper insights into the relationship between dataset classification tasks and hyperparameters optimization, thus quickly choosing an accurate initial set of hyper-parameters for a new coming classification task. iii) We propose a new Convolutional Neural Networks (CNNs)-based local multi-modal feature learning framework for RGB-D scene classification. This method can effectively capture much of the local structure from the RGB-D scene images and automatically learn a fusion strategy for the object-level recognition step instead of simply training a classifier on top of features extracted from both modalities. Experiments are conducted on two popular datasets to thoroughly test the performance of our method, which show that our method with local multi-modal CNNs greatly outperforms state-of-the-art approaches. Our method has the potential to improve RGB-D scene understanding. Some extended evaluation shows that CNNs trained using a scene-centric dataset is able to achieve an improvement on scene benchmarks compared to a network trained using an object-centric dataset. iv) We propose a novel method for RGB-D data fusion. We project raw RGB-D data into a complex space and then jointly extract features from the fused RGB-D images. Besides three observations about the fusion methods, the experimental results also show that our method achieves competing performance against the classical SIFT. v) We propose a novel method called adaptive Visual-Depth Embedding (aVDE) which learns the compact shared latent space between two representations of labeled RGB and depth modalities in the source domain first. Then the shared latent space can help the transfer of the depth information to the unlabeled target dataset. At last, aVDE matches features and reweights instances jointly across the shared latent space and the projected target domain for an adaptive classifier. This method can utilize the additional depth information in the source domain and simultaneously reduce the domain mismatch between the source and target domains. On two real-world image datasets, the experimental results illustrate that the proposed method significantly outperforms the state-of-the-art methods.
Metadata
Supervisors: | Shao, Ling and Liu, Wei |
---|---|
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Electronic and Electrical Engineering (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.725021 |
Depositing User: | Mr. Ziyun Cai |
Date Deposited: | 16 Oct 2017 08:18 |
Last Modified: | 12 Oct 2018 09:46 |
Download
thesis by Ziyun Cai
Filename: thesis by Ziyun Cai.pdf
Description: thesis by Ziyun Cai
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.