Unsupervised extraction of recurring words from infant-directed speech

Fergus R. McInnes, Sharon Goldwater

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

To date, most computational models of infant word segmentation have worked from phonemic or phonetic input, or have used toy datasets. In this paper, we present an algorithm for word extraction that works directly from naturalistic acoustic input: infant-directed speech from the CHILDES corpus. The algorithm identifies recurring acoustic patterns that are candidates for identification as words or phrases, and then clusters together the most similar patterns. The recurring patterns are found in a single pass through the corpus using an incremental method, where only a small number of utterances are considered at once. Despite this limitation, we show that the algorithm is able to extract a number of recurring words, including some that infants learn earliest, such as Mommy and the child's name. We also introduce a novel information-theoretic evaluation measure.
Original languageEnglish
Title of host publicationProceedings of the 33rd Annual Conference of the Cognitive Science Society
Publication statusPublished - 2011


Dive into the research topics of 'Unsupervised extraction of recurring words from infant-directed speech'. Together they form a unique fingerprint.

Cite this