Streamed models for automatic speech recognition (EPSRC Advanced Research Fellowship)

Project Details

Description

Automatic transcription of careful speech (such as read text or most
broadcast news) recorded in a benign environment (e.g. without
background noise) is already possible with sufficiently high accuracy
(say, 5-20% word error rate) to perform tasks such as summarisation,
indexing or information extraction. However, automatic recognition of
other types of speech is an unsolved problem, for example: speech from
human-human interaction (as opposed to human-machine interaction);
speech found in situations such as dialogues, conference calls,
meetings and so on; speech in the presence of background noise or
speech that has passed through a noisy channel. These types of speech
exhibit properties that cause major problems for current speech
recognition techniques -- that is, hidden Markov models (HMMs) with
simple spectral-envelope features. These properties all have something
in common: the signal is no longer well modelled as a single Markov
process. The observed signal contains information from multiple
processes. Continuing to push HMM research with ever-decreasing
incremental improvements in accuracy is not adding to our
understanding of this problem.

I will view the acoustic environment as being generated by multiple asynchronous processes. These processes will be modelled explicitly, so I therefore propose to investigate models from a family which describe the observed signal as having been generated by more than one process; they do this by using what I will call streams. These streams may be observed or they may be hidden. Streams include observations, hidden states or any other group of variables in a graphical model, which is introduced below. Members of this family have the potential for significantly reduced error rates compared to current techniques used for speech recognition.

Layman's description

This was a long-term, basic research project on automatic speech recognition: the transcribing of speech into a sequence of words, by computer. Whilst advances continue to be made in conventional approaches to this problem, this research fellowship allowed me to examine a variety of novel approaches.

Key findings

This fellowship allowed me to significantly expand the scope of my research in speech recognition and begin to apply my ideas to the problem of speech synthesis. The large number of publications arising from the fellowship illustrates the range of problems I tackled. Key achievements include:
* improved extraction of articulatory features from speech
* successful application of phonetic features to Tandem-based ASR, within and across languages
* work on continuous-state models (linear dynamic models) for ASR
* work on HMM-based speech synthesis
* running the Blizzard Challenge series of speech synthesis evaluations
* work on spoken term detection
StatusFinished
Effective start/end date1/01/0531/12/09

Funding

  • EPSRC: £254,954.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.
  • Measuring the gap between HMM-based ASR and TTS

    Dines, J., Yamagishi, J. & King, S., 1 Sept 2009, Interspeech 2009: 10th Annual Conference of the International Speech Communication Association. International Speech Communication Association, p. 1391-1394 4 p.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Open Access
    File
  • Thousands of voices for HMM-based speech synthesis

    Yamagishi, J., Usabaev, B., King, S., Watts, O., Dines, J., Tian, J., Hu, R., Guan, Y., Oura, K., Tokuda, K., Karhila, R. & Kurimo, M., Sept 2009, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009; Brighton, United Kingdom. p. 420-423 4 p.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Open Access
    File
  • A comparison of grapheme and phoneme-based units for Spanish spoken term detection

    Wang, D., Tejedor, J., Frankel, J., King, S. & Colas, J., Nov 2008, In: Speech Communication. 50, 11-12, p. 980-991 12 p.

    Research output: Contribution to journalArticlepeer-review