Using Pre-trained Language Models for Toxic Comment Classification

Abstract

Toxic comment classification is a core natural language processing task for combating online toxic comments. It follows the supervised learning paradigm which requires labelled data for the training. A large amount of high-quality training data is empirically beneficial to the model performance. Transferring a pre-trained language model (PLM) to a downstream model allows the downstream model to access more data without creating new labelled data. Despite the increasing research on PLMs in NLP tasks, there remains a fundamental lack of understanding in applying PLMs to toxic comment classification. This work focuses on this area from three perspectives.

First, we investigate different transferring strategies for toxic comment classification tasks. We highlight the importance of efficiency during the transfer. The transferring efficiency seeks a reasonable requirement of computational resources and a comparable model performance at the same time. Thus, we explore the continued pre-training in-domain which further pre-trains a PLM with in-domain corpus. We compare different PLMs and different settings for the continued pre-training in-domain.

Second, we investigate the limitations of PLMs for toxic comment classification. Taking the most popular PLM, BERT, as the representative model for our study, we focus on studying the identity term bias (i.e. prediction bias towards comments with identity terms, such as "Muslim" and "Black"). To investigate the bias, we conduct both quantitative and qualitative analyses and study the model explanations. We also propose a hypothesis that builds on the potential relationship between the identity term bias and the subjectivity of comments.

Third, building on the hypothesis, we propose a novel BERT-based model to mitigate the identity term bias. Our method is different from previous methods that try to suppress the model's attention to identity terms. To do so, we insert the subjectivity into the model along with the suggestion of the presence of identity terms. Our method shows consistent improvements on a range of different toxic comment classification tasks.

Metadata

Supervisors:	Zhang, Ziqi and Hopfgartner, Frank
Keywords:	machine learning; hate speech; natural language processing; deep learning; neural networks; text processing; toxic comments; abusive language; model bias; model explanation; rationales; BERT; pre-trained language models; pre-training
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Identification Number/EthosID:	uk.bl.ethos.871090
Depositing User:	Dr ZHIXUE ZHAO
Date Deposited:	09 Jan 2023 15:46
Last Modified:	01 Mar 2023 10:54
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:31833

Download

Final eThesis - complete (pdf)

Filename: Thesis_Cass_final.pdf

Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Using Pre-trained Language Models for Toxic Comment Classification

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics