Representing Automatically Generated Topics

Abstract

Topic models are widely used in natural language processing (NLP). Ensuring that their output is interpretable is an essential area of research with a wide range of applications in several areas, such as the enhancement of exploratory search interfaces. Conventionally, topics are represented by their most probable words. However, these representations are often difficult for humans to interpret. Evaluating representations also presents further challenges. Ideally, humans can gauge the quality of the topics, but it is not always feasible in practical terms. This thesis addresses the limitations related to the output of the topic model in three ways.

First, it proposes and explores a range of alternative representations of topics by re-ranking topic words. Re-ranking adjusts the weights of the words and aims to identify informative words in the topics. This approach is a straightforward remedy, as topics tend to contain ``noisy'' words. Additionally, two approaches to evaluating the topics are proposed: (1) an automatic approach based on a document retrieval task; and (2) a crowdsourcing task. Both approaches demonstrate that re-ranking words improves topic interpretability. In addition, two alternative visual forms of the topic are explored, and a simple list of words representation shows to be more useful than a word cloud.

Second, the thesis introduces a new approach to assigning topics with short descriptive labels. Labelling topics is an important task that aims to improve access to large document collections. Previous work on the automatic assignment of labels to topics has relied on a two-stage approach: (1) retrieve candidate labels from a large pool; and then (2) re-rank candidate labels. However, these approaches can only assign candidate labels from a restricted set that may not include any suitable ones. The new approach uses a sequence-to-sequence neural-based approach to generate labels that do not have this limitation. In addition, two new synthetic datasets of pairs of topics and labels are created to train the models.

Third, this thesis conducts an empirical study on the proposed labelling approaches and performs quantitative and qualitative analyses of the generated labels. The labels are evaluated with gold labels that were rated by humans, and the labels are also evaluated with the topics themselves. The proposed approaches generate appropriate labels that are coherent and relevant to the topics.

Metadata

Supervisors:	Stevenson, Mark and Aletras, Nikolaos
Related URLs:	Code in GitHub (Project)
Keywords:	topic models, lda, interpretability, re-ranking, topic labelling, information retrieval, neural networks
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Identification Number/EthosID:	uk.bl.ethos.837191
Depositing User:	Areej Nasser A Alokaili
Date Deposited:	07 Sep 2021 15:38
Last Modified:	01 Oct 2021 09:53
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:29399

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Representing Automatically Generated Topics

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics