Poulston, Adam Reece Spencer ORCID: https://orcid.org/0000-0002-9364-6630 (2021) User profiling with geo-located social media and demographic data. PhD thesis, University of Sheffield.
Abstract
User profiling is the task of inferring attributes, such as gender or age, of social media users based on the content they produce or their behaviours on-line. Approaches for user profiling typically use machine learning techniques to train user profiling systems capable of inferring the attributes of unseen users, having been provided with a training set of users labelled with their attributes. Classic approaches to user attribute labelling for such a training set may be manual or automated, examples include: direct solicitation through surveys, manual assignment based on outward characteristics, and extraction of attribute key-phrases from user description fields.
Social media platforms, such as Twitter, often provide users with the ability to attach their geographic location to their posts, known as geo-location. In addition, government organisations release demographic data aggregated at a variety of geographic scales. The combination of these two data sources is currently under-explored in the user profiling literature. To combine these sources, a method is proposed for geo-location-driven user attribute labelling in which a coordinate level prediction is made for a user's 'home location', which in turn is used to 'look up' corresponding demographic variables that are assigned to the user.
Strong baseline components for user profiling systems are investigated and validated in experiments on existing user profiling datasets, and a corpus of geo-located Tweets is used to derive a complementary resource. An evaluation of current methods for assigning fine-grained home location to social media users is performed, and two improved methods are proposed based on clustering and majority voting across arbitrary geographic regions. The proposed geo-location-driven user attribute labelling approach is applied across three demographic variables within the UK: Output Area Classification (OAC), Local Authority Classification (LAC), and National Statistics Socio-economic Classification (NS-SEC). User profiling systems are trained and evaluated on each of the derived datasets, and NS-SEC is additionally validated against a dataset derived through a different method. Promising results are achieved for LAC and NS-SEC, however characteristics of the underlying geographic and demographic data can lead to poor quality datasets, as displayed for OAC.
Metadata
Supervisors: | Mark, Stevenson and Kalina, Bontcheva |
---|---|
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.834096 |
Depositing User: | Mr Adam Reece Spencer Poulston |
Date Deposited: | 18 Jul 2021 19:51 |
Last Modified: | 01 Sep 2021 09:53 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:29141 |
Download
Final eThesis - complete (pdf)
Filename: poulston_140127463_thesis_corrected.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.