Interpolating between types and tokens by estimating power-law generators

Sharon Goldwater, Tom Griffiths, Mark Johnson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 18
EditorsY. Weiss, B. Schölkopf, J. Platt
Place of PublicationCambridge, MA
PublisherMIT Press
Pages459-466
Number of pages8
Publication statusPublished - 2006

Fingerprint

Dive into the research topics of 'Interpolating between types and tokens by estimating power-law generators'. Together they form a unique fingerprint.

Cite this