Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
|Title of host publication||Advances in Neural Information Processing Systems 18|
|Editors||Y. Weiss, B. Schölkopf, J. Platt|
|Place of Publication||Cambridge, MA|
|Number of pages||8|
|Publication status||Published - 2006|