Buckley, Shaun Charles
ORCID: https://orcid.org/0009-0005-5164-7095
(2025)
Unsupervised Detection and Synthesis of Speech and Environmental Sounds Using Generative Networks.
PhD thesis, University of Leeds.
Abstract
The task of isolating specific sounds in a noisy environment is known as the “cocktail party problem”, where an individual needs to filter out all the irrelevant sounds and speakers to focus on a specific individual, for either localisation or understanding. Humans do this in a variety of settings outside of speech, such as focusing on individual instruments or vocals in a piece of music. This isn’t unique to just humans, as most animals capable of hearing need to focus on specific environmental noises for their own safety, whether that’s localizing where a branch just broke warning of an encroaching predator or tracking the sounds of potential prey. Filtering out irrelevant or unnecessary sounds to focus on a specific source is a natural function for hearing creatures, but a challenging mechanism to replicate in machines. Great advances have been made in the areas of speech denoising and music source separation, but the holy grail of universal sound separation that humans are capable of is still out of reach. Whilst there have been attempts to produce models that can separate several arbitrary sources, most make assumptions on either the number of sounds in a mixture and/or the number of classes in the dataset. Furthermore, these separation models are computationally expensive to train, often requiring multiple GPUs and large amounts of memory. To address these problems, first we combine the efficiency of convolutional networks with the global receptive field of axial transformers to produce a model for separating arbitrary sounds from entangled audio mixtures by capturing longterm dependencies within the data whilst reducing memory requirements. We then utilize the structured latent space of unsupervised VAEs to perform semi-supervised labelling of multi-class input such as audio mixtures by measuring the divergence between the latent distributions of a given sample and a set of given classes. Finally, we propose a generative audio framework using infinite generative adversarial networks for performing audio synthesis, class detection and audio
classification.
Metadata
| Supervisors: | Hogg, David and Wang, He |
|---|---|
| Keywords: | Audio Source Separation; Single-Channel Audio Separation; Audio Generation; Audio Synthesis; Audio Classification; Transformers; CNNs; Generative Adversarial Networks; GANs; VAEs; CRP; ACRP |
| Awarding institution: | University of Leeds |
| Academic Units: | The University of Leeds > Faculty of Engineering (Leeds) |
| Academic unit: | School of Computer Science |
| Date Deposited: | 16 Jan 2026 11:42 |
| Last Modified: | 16 Jan 2026 11:42 |
| Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:37846 |
Download
Final eThesis - complete (pdf)
Filename: Buckley_SCB_Computing_PhD_2025.pdf
Licence:

This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.