Video Synthesis of a Talking Head

Abstract

The ability to synthesise a video talking head from speech audio has many potential applications, such as video conferencing, video animation production and virtual assistants. Although there has been considerable prior work on this task, the quality of generated videos is typically limited in terms of overall realism and resolution. In this thesis, we propose a novel approach for synthesis of a high-resolution talking head video (1024x1024 in our experiments) from speech audio and a single identity image.

The approach is built on top of a pre-trained StyleGAN image generator. We model trajectories in the latent space of the generator conditioned on speech utterances. To train this model, we use a dataset of talking-head videos, which are mapped into the latent space of the image generator using an image encoder that is also pre-trained. We train a recurrent neural network to map from speech utterances to sequences of displacements in the latent space of the image generator. These displacements are applied to the back-projection into the latent space of a single identity frame chosen from a target video in the training dataset.

The thesis begins by reporting on an experimental evaluation of existing GAN inversion methods that map video frames into the latent space of a pre-trained StyleGAN image generator. We apply one such inversion method to train an unconditional video generator that requires an identity image and a random seed for the dynamical process that generates a trajectory through the latent space of the image generator.

We evaluate our method for talking head synthesis from speech audio with standard measures and show that it significantly outperforms recent state-of-the-art methods on commonly used audio-visual talking-head datasets (GRID and TCD-TIMIT). We perform the evaluation with two versions of StyleGAN; one trained on video frames depicting talking heads and the other on faces with static expressions (i.e., not talking). The quality of the results is shown to be better when using StyleGAN pre-trained on talking heads. However, the range of possible identities is narrower due to the much smaller set of identities in the talking head dataset.

Metadata

Supervisors:	Hogg, David and Bulpitt, Andrew
Related URLs:	Related publication
Keywords:	talking head, video synthesis
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Depositing User:	Mr Mohammed Mesfer A Alghamdi
Date Deposited:	29 Jan 2024 16:38
Last Modified:	29 Jan 2024 16:38
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:34144

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Video Synthesis of a Talking Head

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics