Abstract
Large diachronic text corpora enable a data-based approach to the evolution and dynamics of language. The development of unsupervised methods for the inference of semantic similarity, semantic change and polysemy from such large datasets mean that, in addition to measuring orthographic similarity or counting frequencies (e.g., Petersen et al. 2012, Bochkarev et al. 2015), it is also possible to measure meaning and therefore semantic evolution over time. This has given rise to a body of work – notably, scattered across multiple disciplines and thus at times being carried out in parallel – dealing with the postulation and evaluation of trends or laws in language dynamics relating to semantic change (e.g., Dubossarsky et al. 2016, Hamilton et al. 2016, Xu and Kemp 2015).
However, concerns have been raised regarding these corpus-based approaches, arising from the inherent sampling biases of corpora (Pechenick et al. 2015), the influence of world events on the composition of topics in corpora (Chelsey and Baayen 2010, Lijffijt et al. 2012, Szmrecsanyi 2016), and most recently, methodological problems arising from diachronic applications of distributed semantics methods, shown to be more closely tied to frequency (and frequency change) than previously assumed (Dubossarsky et al. 2017). Additionally, there is a lack of gold standard datasets to evaluate the performance of automatic semantic change measures (with the exception of some small test sets, e.g., Gulordava et al. 2011, Schlechtweg et al. 2017).
We review these developments and propose solutions to two of the aforementioned issues. We demonstrate a simple model capable of controlling for topical fluctuations in a corpus, and show that it describes a considerable amount of variance in diachronic word frequency changes. Furthermore, we discuss a tentative approach to control for potentially frequency-biased results of semantic change measures, demonstrating its utility using simulations of change on artificially composed corpora, providing a controlled test of our technique.
However, concerns have been raised regarding these corpus-based approaches, arising from the inherent sampling biases of corpora (Pechenick et al. 2015), the influence of world events on the composition of topics in corpora (Chelsey and Baayen 2010, Lijffijt et al. 2012, Szmrecsanyi 2016), and most recently, methodological problems arising from diachronic applications of distributed semantics methods, shown to be more closely tied to frequency (and frequency change) than previously assumed (Dubossarsky et al. 2017). Additionally, there is a lack of gold standard datasets to evaluate the performance of automatic semantic change measures (with the exception of some small test sets, e.g., Gulordava et al. 2011, Schlechtweg et al. 2017).
We review these developments and propose solutions to two of the aforementioned issues. We demonstrate a simple model capable of controlling for topical fluctuations in a corpus, and show that it describes a considerable amount of variance in diachronic word frequency changes. Furthermore, we discuss a tentative approach to control for potentially frequency-biased results of semantic change measures, demonstrating its utility using simulations of change on artificially composed corpora, providing a controlled test of our technique.
Original language | English |
---|---|
Pages | 494 |
Publication status | Published - 31 Aug 2018 |
Event | Societas Linguistica Europaea 51th annual meeting - Tallinn, Estonia Duration: 29 Aug 2018 → 1 Sept 2018 |
Conference
Conference | Societas Linguistica Europaea 51th annual meeting |
---|---|
Country/Territory | Estonia |
City | Tallinn |
Period | 29/08/18 → 1/09/18 |