Edinburgh Research Explorer

Inducing a lexicon of sociolinguistic variables from code-mixed text

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Original languageEnglish
Title of host publication2018 The 4th Workshop on Noisy User-generated Text (W-NUT)
Subtitle of host publicationNov 1, 2018, Brussels, Belgium (at EMNLP 2018)
Place of PublicationBrussels, Belgium
PublisherAssociation for Computational Linguistics
Number of pages6
Publication statusPublished - Nov 2018
Event4th Workshop on Noisy User-generated Text (W-NUT): At EMNLP 2018 - Brussels, Belgium
Duration: 1 Nov 20181 Nov 2018


Workshop4th Workshop on Noisy User-generated Text (W-NUT)
Abbreviated titleW-NUT 2018
Internet address


Sociolinguistics is often concerned with how variants of a linguistic item (e.g., nothing vs. nothin’) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (football, fitba) along with their linguistic code (football→British, fitba→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70% for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well.


4th Workshop on Noisy User-generated Text (W-NUT): At EMNLP 2018


Brussels, Belgium

Event: Workshop


  • Discovering and analysing lexical variation in social media text

    Student thesis: Doctoral Thesis

Download statistics

No data available

ID: 74866554