Distinguishing the Wood from the Trees: Contrasting Collection Methods to Understand Bias in a Longitudinal Brexit Twitter Dataset

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Various methods can be used for searching or streaming Twitter data to gather a sample on a specific topic. All of these methods introduce a bias into the resulting datasets. Here we examine, and try to define, the bias that the different strategies introduce. Understanding the bias means that we can extrapolate wider meaning from the data in a more precise manner. We use datasets collected on topics from the UK-EU Brexit referendum conducted in 2016. Each dataset discussed draws data from Twitter over a twelve-month period, from 1st September 2015 until 31st August 2016. Three data collection strategies are considered: collecting on human defined topic specific hashtags; collecting using a semi-automated technique to identify topic terms which are then used to collect tweets; and collecting from predefined users known to be tweeting on the topic. To investigate bias in the data we look at, and find wide variation in: group level metadata attributes such as size of the dataset; number of users in each set; average numbers of friends and followers; likely re-tweet status; and levels of inclusion of various add-ons such as hashtags, URLs and media. We also find that relevance to the topic differs between the sets; being far higher in the known users set. We investigate how readability of tweets within each set varies, particularly between known users and topic term sets. We also find that there is a surprising lack of overlap in the data obtained using different collection methods.
Original languageEnglish
Title of host publicationProceedings of the Eleventh International AAAI Conference on Web and Social Media
PublisherThe AAAI Press
Number of pages4
Publication statusPublished - 3 May 2017
Event11th International AAAI Conference on Web and Social Media - Montreal, Canada
Duration: 15 May 201718 May 2017
http://www.kdd.org/calls/view/the-11th-international-aaai-conference-on-web-and-social-media-icwsm-2017-c

Publication series

NameInternational AAAI Conference on Web and Social Media
PublisherThe AAAI Press
Volume11
ISSN (Print)2162-3449
ISSN (Electronic)2334-0770

Conference

Conference11th International AAAI Conference on Web and Social Media
Abbreviated titleICWSM 2017
CountryCanada
CityMontreal
Period15/05/1718/05/17
Internet address

Fingerprint

Dive into the research topics of 'Distinguishing the Wood from the Trees: Contrasting Collection Methods to Understand Bias in a Longitudinal Brexit Twitter Dataset'. Together they form a unique fingerprint.

Cite this