The proposal for this project concerned novel acoustic models for
automatic speech recognition (ASR), and in particular a model that we
had been working on up to that point: the Linear Dynamical Model
(LDM). This is a continuous hidden state model, in contrast to the
discrete hidden state of the Hidden Markov Model (HMM).
This was a basic research project on automatic speech recognition. Conventional systems are based on a statistical model called the Hidden Markov Model which captures the properties of speech using a single discrete variable. In this project we investigated models which use a continuously-valued state, because that may be a better match to the way speech is produced through the continuous motions of the articulators (tongue, lips, etc). This project extended an earlier version of such a model.
Work on LDMs
In order to understand one of the primary research directions of this
project, we note two important points regarding the LDM observation
process as formulated above. Firstly, the observation distribution
does not have diagonal covariance, instead it is made up of a
projection from a lower dimension state via the matrix H, and then
combined with the observation noise covariance (which may have
diagonal or full covariance). In this way, the LDM incorporates a
model of the dependencies between observation dimensions, in addition
to the model of underlying dynamics. Secondly, the output distribution
is uni-modal. Knowledge of speech parameters and experience from HMMs
tells us that using a multi-modal distribution should be of great
benefit. However, simply producing a mixture LDM (with mixture
distributions on the noise or posing another parameter such as $H$ as
a mixture) would lead to a model which is computationally infeasible.
In this project we proposed a switching structure in order to
approximate a multi-modal distribution whilst retaining a
computationally practical model. In order to do this, and make
efficient use of parameters, it became apparent that a couple of extra
techniques would first be required.
We had found considerable benefit from using a full-covariance
observation noise model, though in a switching setting this will
clearly lead to over-parameterisation and problems of
data-sparsity. In order to work around this problem we derived a
flexible parameter tying method based on making factorisations
of precision matrices.
Factoring precision matrices for LDMs
The first technique we developed for LDMs was that of factoring
precision matrices. The motivation was to provide a flexible and
general way in which to tie the observation noise covariance
matrices.
LDM parameter adaptation
We developed adaptation of the observation process with a view to
minimising the hidden state trajectory differences between speakers,
so that the LDM can learn the underlying, speaker-independent, state
dynamics of each class being modelled.
Changing formalism: discrete-state Dynamic Bayesian Networks
After working on these techniques, we decided to move away from LDMs
in favour of dynamic Bayesian networks (DBNs) with discrete hidden
variables. It should be noted that the techniques described above will
still find applications, because LDMs are widely used in other areas
of machine learning and in control.
Work on DBNs
The common thread through this work is the incorporation of knowledge
of speech production into model structure, using data-driven methods
as far as possible (i.e. using data to learn all model parameters, and
also to learn some aspects of model structure). By factoring the state
and/or observation processes into set of discrete-valued articulatory
features, we aim to build a system which can encode distinctions which
are difficult to express with a phone-based representation
(e.g. coarticulation effects such as nasalisation of vowels).
Restricting ourselves to models where the only hidden variables are
discrete means that, if desired, the model could be rewritten as a
conventional HMM. This would obscure the true structure of the state
space and remove many possibilities for efficient parameterisation
(e.g. tying) of the model. However, if a fully trained DBN were
``flattened'' into an HMM in this way, it would then be usable with a
conventional HMM-based decoder.
Our work has focused on the task of articulatory feature (AF)
recognition, where we have demonstrated substantial improvements over
the previous state-of-the-art. One advance we have made is to develop
an embedded training scheme which reduces the dependence on
phone-derived feature labels, and allows a set of asynchronous feature
changes to be learned from data.
We have also applied articulatory feature-based methods via hybrid and
tandem approaches, in a Johns Hopkins 2006 Summer Workshop project
where King was a senior team member