Evaluating the similarity of location-based corpora identified in Reddit comments

Cillian Berragan*, Alex Singleton, Alessia Calafiore, Jeremy Morley

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Social interaction is typically studied from the context of physical movement, where geographic distance and ease of connectivity influence the strength of interaction between regions. From the point of view of social media networks however, these limitations appear to still persist, despite interactions not being reliant on physical movement, suggesting non-physical geographic characteristics influence interaction between social communities. Unlike geotags, which provide explicit geographic information about social media users as coordinates, unstructured text presents an alternative perspective for the study of social interaction between regions, instead allowing for the comparison between the language used when mentioning locations in context. Our paper analyses the corpora associated with major cities across the UK, first vectorising Reddit comments through transformer-based embeddings, which capture semantic information, then using these to establish unsupervised clusters and similarity between them. We find that distinct groups emerge which broadly conform with established regional identities of locations across the UK, but with interesting deviations.
Original languageEnglish
Title of host publicationProceedings of the First Workshop on Geographic Information Extraction from Texts (GeoExT 2023) co-located with The 45th European Conference on Information Retrieval (ECIR 2023)
EditorsXuke Hu, Yingjie Hu, Bernd Resch, Jens Kersten, Kristin Stock
PublisherCEUR Workshop Proceedings
Number of pages6
Publication statusPublished - 2 Apr 2023
Event1st Workshop on Geographic Information Extraction from Texts, GeoExT 2023 - Dublin, Ireland
Duration: 2 Apr 2023 → …

Publication series

NameCEUR Workshop Proceedings
ISSN (Electronic)1613-0073


Conference1st Workshop on Geographic Information Extraction from Texts, GeoExT 2023
Period2/04/23 → …

Keywords / Materials (for Non-textual outputs)

  • natural language processing
  • social interaction
  • social media


Dive into the research topics of 'Evaluating the similarity of location-based corpora identified in Reddit comments'. Together they form a unique fingerprint.

Cite this