Cai, Ziyun (2017) Feature Learning for RGB-D Data. PhD thesis, University of Sheffield.
Abstract
RGB-D data has turned out to be a very useful representation for solving fundamental computer
vision problems. It takes the advantages of the color images that provide appearance
information of an object and also the depth image that is immune to the variations in color,
illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect
sensor, which was initially used for gaming and later became a popular device for computer
vision, high quality RGB-D data can be acquired easily. RGB-D image/video can facilitate
a wide range of application areas, such as computer vision, robotics, construction and medical
imaging. Furthermore, how to fuse RGB information and depth information is still a
problem in computer vision. It is not enough to simply concatenate RGB data and depth
data together. A new fusion method could better fuse RGB images and depth images. It
still needs more powerful algorithms on this. In this thesis, to explore more advantages of
RGB-D data, we use some popular RGB-D datasets for deep feature learning algorithms
evaluation, hyper-parameter optimization, local multi-modal feature learning, RGB-D data
fusion and recognizing RGB information from RGB-D images: i)With the success of Deep
Neural Network in computer vision, deep features from fused RGB-D data can be proved to
gain better results than RGB data only. However, different deep learning algorithms show
different performance on different RGB-D datasets. Through large-scale experiments to
comprehensively evaluate the performance of deep feature learning models for RGB-D image/
video classification, we obtain the conclusion that RGB-D fusion methods using CNNs
always outperform other selected methods (DBNs, SDAE and LSTM). On the other side, since
LSTM can learn from experience to classify, process and predict time series, it achieved
better performances than DBN and SDAE in video classification tasks. ii) Hyper-parameter
optimization can help researchers quickly choose an initial set of hyper-parameters for a new
coming classification task, thus reducing the number of trials in terms of hyper-parameter
space. We present a simple and efficient framework for improving the efficiency and accuracy
of hyper-parameter optimization by considering the classification complexity of a
particular dataset. We verify this framework on three real-world RGB-D datasets. After
the analysis of experiments, we confirm that our framework can provide deeper insights
into the relationship between dataset classification tasks and hyperparameters optimization, thus quickly choosing an accurate initial set of hyper-parameters for a new coming classification
task. iii) We propose a new Convolutional Neural Networks (CNNs)-based local
multi-modal feature learning framework for RGB-D scene classification. This method can
effectively capture much of the local structure from the RGB-D scene images and automatically
learn a fusion strategy for the object-level recognition step instead of simply training a
classifier on top of features extracted from both modalities. Experiments are conducted on
two popular datasets to thoroughly test the performance of our method, which show that our
method with local multi-modal CNNs greatly outperforms state-of-the-art approaches. Our
method has the potential to improve RGB-D scene understanding. Some extended evaluation
shows that CNNs trained using a scene-centric dataset is able to achieve an improvement
on scene benchmarks compared to a network trained using an object-centric dataset.
iv) We propose a novel method for RGB-D data fusion. We project raw RGB-D data into
a complex space and then jointly extract features from the fused RGB-D images. Besides
three observations about the fusion methods, the experimental results also show that our
method achieves competing performance against the classical SIFT. v) We propose a novel
method called adaptive Visual-Depth Embedding (aVDE) which learns the compact shared
latent space between two representations of labeled RGB and depth modalities in the source
domain first. Then the shared latent space can help the transfer of the depth information to
the unlabeled target dataset. At last, aVDE matches features and reweights instances jointly
across the shared latent space and the projected target domain for an adaptive classifier. This
method can utilize the additional depth information in the source domain and simultaneously
reduce the domain mismatch between the source and target domains. On two real-world
image datasets, the experimental results illustrate that the proposed method significantly
outperforms the state-of-the-art methods.
Metadata
Supervisors: | Shao, Ling and Liu, Wei |
---|---|
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Electronic and Electrical Engineering (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.725021 |
Depositing User: | Mr. Ziyun Cai |
Date Deposited: | 16 Oct 2017 08:18 |
Last Modified: | 12 Oct 2018 09:46 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:18370 |
Download
thesis by Ziyun Cai
Filename: thesis by Ziyun Cai.pdf
Description: thesis by Ziyun Cai
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.