Edinburgh Research Explorer

Data-driven articulatory modelling: foundations for a new generation of speech synthesis

Project: Research

StatusFinished
Effective start/end date1/11/0630/04/10
Total award£358,572.00
Funding organisationEPSRC
Funder project referenceEP/E027741/1
Period1/11/0630/04/10

Key findings

At the time this project began, the dominant method for text-to-speech synthesis, whereby a computer is made to convert text to audible "artificial" speech, was called "unit selection". This method relies on "gluing" together fragments of speech carefully chosen from several hours of recordings of a real human talking. The benefits of the approach are its simplicity and that it sounds exactly like the human that made the original recordings. The downsides, though, are the limited scope for changing the qualities of the synthesised speech, and the expense of making large numbers of high quality audio recordings for each new synthetic voice.
This project has pursued a different approach to synthesising speech, which is generally termed "statistical parametric synthesis". Instead of merely glueing together pre-recorded snippets of speech, this approach applies powerful statistical models to examples of spoken sentences in order to learn how to produce new speech. For example, the model will learn how the underlying sounds of English combine to produce a word, and how words combine to produce a natural sounding sentence. Over the course of this project, this approach has rapidly gained popularity, and research in this direction has intensified around the world.
However, though the new statistical models indeed offer a great deal more flexibility in theory, in practice they are hugely complex and can be unwieldy to control, so exploiting that flexibility can be difficult. The major aim of this project has been to address this problem and to find ways to incorporate extra information into the statistical model, which can in turn be used to control and manipulate synthetic speech in a straightforward, transparent way. Specifically, we have sought to incorporate information about the human speech production mechanism (i.e. the articulators, such as the tongue, lips and jaw, and the vocal chords).
The most critical key findings of this project are the ways that have been demonstrated to show this may be successfully achieved. First, for example, it has been demonstrated that knowledge about the vibrations of the vocal chords can be incorporated in order to improve the voice quality of the synthesised speech, as well as offering explicit control over how the voice sounds. Second, it has been found possible to incorporate information about movements of the mouth, and then to control synthesis in terms of mouth movements. As examples of this control, we have demonstrated changing one sound to another, thus changing the identity of a word, for example changing "bed" to "bad", simply by changing the position of the model's "tongue". We have also shown it is possible to create speech sounds that are completely new and which match the general quality of the synthetic voice. This means the accent of the speech synthesiser may be modified, or foreign sounds may be incorporated seamlessly into the synthetic speech, allowing the synthesiser to speak in multiple languages and accents with the same voice.
Though this project has dealt primarily with articulatory data as the extra information, the general approach has since been expanded to work with other representations. For example, work is currently being undertaken to capture the noise environment to use as additional information for the synthesiser to use. When human talks in a noisy environment, they are known to change the way they speak. Ideally, a computer synthesiser will do the same to make synthetic speech more intelligible in varying noise conditions.
Beyond speech synthesis alone, an additional key finding of the project is a method to accurately predict articulatory movements from text. This is especially useful, for example, in applications such as animated talking heads and computer animation for films and games.
As a final key finding, the research work conducted during this project has confirmed both how useful articulatory data is, but also how difficult it is to collect in large quantities. Because of this, only small amounts of articulatory data have previously been released to the research community. Therefore, data recorded as part of this project has been released to share with other researchers worldwide at no cost.

Research outputs