A segmental framework for fully-unsupervised large-vocabulary speech recognition

Herman Kamper, Aren Jansen, Sharon Goldwater

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early systems focused on identifying isolated recurring terms in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectively performing unsupervised speech recognition. To our knowledge, this article presents the first such system evaluated on large-vocabulary multi-speaker data. The system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). We show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system's discovered clusters are still less pure than those of two multi-speaker term discovery systems, but provide far greater coverage.
Original languageEnglish
Pages (from-to)154-174
Number of pages21
JournalComputer Speech and Language
Early online date18 May 2017
Publication statusPublished - 1 Nov 2017


Dive into the research topics of 'A segmental framework for fully-unsupervised large-vocabulary speech recognition'. Together they form a unique fingerprint.

Cite this