We propose to replace three components of a typical concatenative
speech synthesiser: the text selection algorithm (what to record for
the database), the target cost function (which units to select from
the database) and the backoff strategy (what to do when the database does
not contain the desired unit).
These components are
currently designed independently using human intuition. This is very
hard, can only be done by experts, and means that each component is
unlikely to be optimised with respect to the others. We propose to
base these three components on a single underlying model. The
model will learn, from data, which speech units are perceptually
interchangeable. This information will then be used by the target
cost function / backoff strategy, and when selecting the text to be recorded. The
proposed techniques will be implemented in the Festival 2 speech
synthesis system and evaluated using formal listening tests.
Speech synthesis is the processing of converting written language (i.e., text) into spoken language (i.e., speech) by computer. It is a hard problem because the text does not include all the information needed to create the corresponding speech. At the time this project was carried out, the method for creating speech was to paste together small fragments of natural recorded speech and play them back to create new sentences. The selection of the fragments to use is the hard part, and involves a number of considerations: 1) what recordings to start from, 2) which fragments to select and 3) what to do when a required fragment is missing from the recordings.
The project aimed to solve all 3 problems at once, rather than using separate sub-optimal solutions as was normal at that time.
Aside from the problem that text does not contain all the information needed to produce speech, the other reason speech synthesis is hard is that we are trying to convince human listeners that the speech we produce is natural, even though there will be differences between the synthesised speech and a natural rendering of the same sentence. In standard synthesisers, surprisingly little use is made of this fact. In particular, human listeners are sensitive to some differences, but cannot hear others.
We devised a novel way to select the fragments (technical name: "units") from the database by training a classifier on judgements made by listeners. Our innovation was to incorporate information about listeners' abilities into the system.
|Effective start/end date||1/04/07 → 30/06/10|