Automatically-determined Unit inventories for automatic speech recognition

Project Details


Conventional automatic speech recognition (ASR) systems generally
model speech as a sequence of phones, using Hidden Markov Models
(HMMs) of those phones. The acoustic realisation of phonemes depends
heavily on surrounding context, so context-dependent models are
used. Phonemes are a useful linguistic device, but are they the best
unit for ASR?

An alternative view of speech is as a set of parallel feature streams,
such as phonetic features (vowel height, frication, voicing,
etc.). Contextual1 effects are more simply described in terms of such
features. Phonetic features (or indeed direct articulatory
measurements) do not change value synchronously at phone
boundaries. However these effects are not easily described, or
accounted for, using conventional phone-sized units, and conventional
HMMs cannot generate asynchronous observation streams.

Two things are required: an acoustic model which can account for
asynchrony and an inventory of appropriate units to model. In this
project we build on previous work which investigated the first
requirement, and now address the second. The project will investigate
methods for optimising the inventory of units to be modelled in
conjunction with alternative acoustic models to the HMM. This will
allow such models to realise their true potential, which until now has
been restricted by the use of unit inventories best suited to the HMM
(e.g. triphones).

Layman's description

In order to make computers recognise speech, it is usual to use statistical models. These models are learned from recorded speech data - a process called 'training'.
It is necessary to use models of sub-word units. This is because the training data cannot contain a recording of every possible word that might be encountered later, when the system is used. Although there is a conventional set of sub-word units that is most commonly used - called 'phonemes', these are not entirely satisfactory for automatic speech recognition because they a very context-sensitive: they are said differently, depending on what sounds surround them. This project sought a better set of sub-word units, and tried to find this set automatically, rather than design it by hand.

Key findings

The direction the project took was to examine an alternative way to represent continuous speech - instead of a string of phonemes, we used a set of parallel, overlapping features known as acoustic-articulatory features. These have the power to represent the way that phonemes change depending on the context. The key findings in the project were improved methods for recognising this representation from speech signals, and the commencement of a new line of research on Dynamic Bayesian Networks for modelling this representation.
Effective start/end date1/01/0331/12/05


  • EPSRC: £63,684.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.