Bootstrapping word boundaries: A bottom-up corpus-based approach to speech segmentation

Paul Cairns, Richard Shillcock, Nick Chater, Joe Levy

Research output: Contribution to journalArticlepeer-review


Speech is continuous, and isolating meaningful chunks for lexical access is a nontrivial problem. In this paper we use neural network models and more conventional statistics to study the use of sequential phonological probabilities in the segmentation of an idealized phonological transcription of the London–Lund Corpus; these speech data are representative of genuine conversational English. We demonstrate, first, that the distribution of phonetic segments in English is an important cue to segmentation, and, second, that the distributional information is such that it might allow the infant, beginning with only a sensitivity to the statistics of subsegmental primitives, to bootstrap into a series of increasingly sophisticated segmentation competences, ending with an adult competence. We discuss the relation between the behavior of the models and existing psycholinguistic studies of speech segmentation. In particular, we confirm the utility of the Metrical Segmentation Strategy (Cutler & Norris, 1988) and demonstrate a route by which this utility might be recognized by the infant, without requiring the prior specification of categories like ‘‘syllable’’ or ‘‘strong syllable.’’
Original languageEnglish
Pages (from-to)111-153
Number of pages43
JournalCognitive Psychology
Issue number2
Publication statusPublished - Jul 1997


Dive into the research topics of 'Bootstrapping word boundaries: A bottom-up corpus-based approach to speech segmentation'. Together they form a unique fingerprint.

Cite this