Projects per year
Abstract
Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early systems focused on identifying isolated recurring terms in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectively performing unsupervised speech recognition. To our knowledge, this article presents the first such system evaluated on large-vocabulary multi-speaker data. The system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). We show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system's discovered clusters are still less pure than those of two multi-speaker term discovery systems, but provide far greater coverage.
Original language | English |
---|---|
Pages (from-to) | 154-174 |
Number of pages | 21 |
Journal | Computer Speech and Language |
Volume | 46 |
Early online date | 18 May 2017 |
DOIs | |
Publication status | Published - 1 Nov 2017 |
Fingerprint
Dive into the research topics of 'A segmental framework for fully-unsupervised large-vocabulary speech recognition'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Word segmentation from noisy data with minimal supervision.
Goldwater, S. (Principal Investigator)
24/01/11 → 23/04/14
Project: Research
Profiles
-
Sharon Goldwater
- School of Informatics - Personal Chair of Computational Language Learning
- Institute of Language, Cognition and Computation
- Language, Interaction, and Robotics
Person: Academic: Research Active