Qiu, Shenghao
ORCID: https://orcid.org/0000-0001-6345-0306
(2025)
Accelerating Deep Learning by Optimising Communication and Computation Efficiency.
PhD thesis, University of Leeds.
Abstract
Deep Neural Networks (DNNs) underpin many of today’s business and scientific innovations. However, realising the full potential of DNNs and making them more accessible requires tackling multiple challenges across the computing stack. These include improving the efficiency of distributed communication across computing devices and computation on a single computing node. This thesis presents new optimisations for DNN training across multiple distributed machines and on a single machine. Our techniques reduce DNN training times, making DNNs faster and more efficient.
First, we improve the communication efficiency of distributed training through multi-streamed gradient communication. Our approach allows a training worker to participate in multiple gradient communication operations simultaneously to improve network bandwidth utilisation and reduce communication latency, achieving up to 13.4x training throughput compared to existing distributed training frameworks.
Next, we develop new techniques to improve DNNs’ performance on single machines, using Graph Neural Networks (GNNs) as a case study. While GNNs are a powerful tool for processing graph data, their scalability is hindered by the large memory footprint typically required to process real-life datasets. To mitigate this issue, we leverage two kernel-based memory-saving strategies: lossless sparse matrix storage and Tensor-train Decomposition (TTD), which trades accuracy for memory efficiency. Our goal is to accelerate the computation of these memory-saving techniques. For sparse matrix storage, we enhance GNN kernels by introducing a machine-learning-based method to dynamically select the optimal sparse storage format based on the input data, improving sparse GNN kernels by up to 3x. To accelerate TTD-based GNNs, we optimise data reuse and kernel invocations, finding optimal TTD configurations to improve training convergence. Our optimisations improve TTD-based GNN training throughput by up to 8.17x, reducing the GNN memory footprint by up to 5,474x without compromising
model accuracy.
Metadata
| Supervisors: | Wang, Zheng and Xu, Jie |
|---|---|
| Keywords: | machine learning, deep neural network, graph neural network, distributed deep learning, model training, high-performance computing, optimisation |
| Awarding institution: | University of Leeds |
| Academic Units: | The University of Leeds > Faculty of Engineering (Leeds) |
| Academic unit: | School of Computer Science |
| Date Deposited: | 16 Jan 2026 10:15 |
| Last Modified: | 16 Jan 2026 10:15 |
| Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:37800 |
Download
Final eThesis - complete (pdf)
Filename: Qiu_SHQ_Computer Science_PhD_2025.pdf
Licence:

This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.