Abstract
Observed regional variation in geotagged social media text is often attributed to dialects, where features in language are assumed to exhibit region-specific properties. While dialects are seen as a key component in defining the identity of regions, there are a multitude of other geographic properties that may be captured within natural language text. In our work, we consider locational mentions that are directly embedded within comments on the social media website Reddit, providing a range of associated semantic information, and enabling deeper representations between locations to be captured. Using a large corpus of geoparsed Reddit comments from UK-related local discussion subreddits, we first extract embedded semantic information using a large language model, aggregated into local authority districts, representing the semantic footprint of these regions. These footprints broadly exhibit spatial autocorrelation, with clusters that conform with the national borders of Wales and Scotland. London, Wales, and Scotland also demonstrate notably different semantic footprints compared with the rest of Great Britain.
Original language | English |
---|---|
Article number | 102121 |
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | Computers, Environment and Urban Systems |
Volume | 110 |
Early online date | 26 Apr 2024 |
DOIs | |
Publication status | Published - Jun 2024 |
Keywords / Materials (for Non-textual outputs)
- Natural Language Processing
- semantics
- social media
- vernacular geography