Urban Dictionary Embeddings for Slang NLP Applications

Steve Wilson, Walid Magdy, Barbara McGillivray, Kiran Garimella, Gareth Tyson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.
Original languageEnglish
Title of host publicationProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)
PublisherEuropean Language Resources Association (ELRA)
Pages 4764–4773
Number of pages10
ISBN (Electronic)979-10-95546-34-4
Publication statusPublished - 16 May 2020
Event12th Language Resources and Evaluation Conference - Le Palais du Pharo, Marseille, France
Duration: 11 May 202016 May 2020
Conference number: 12
https://lrec2020.lrec-conf.org/en/

Conference

Conference12th Language Resources and Evaluation Conference
Abbreviated titleLREC 2020
CountryFrance
CityMarseille
Period11/05/2016/05/20
Internet address

Keywords

  • word embeddings
  • urban dictionary
  • slang
  • sentiment
  • sarcasm

Fingerprint Dive into the research topics of 'Urban Dictionary Embeddings for Slang NLP Applications'. Together they form a unique fingerprint.

Cite this