Abstract
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
Original language | English |
---|---|
Title of host publication | Advances in Neural Information Processing Systems 18 |
Editors | Y. Weiss, B. Schölkopf, J. Platt |
Place of Publication | Cambridge, MA |
Publisher | MIT Press |
Pages | 459-466 |
Number of pages | 8 |
Publication status | Published - 2006 |