Projects per year
This paper describes the use of a low-dimensional vector representation of sentence acoustics to control the output of a feed-forward deep neural network text-to-speech system on a sentence-by-sentence basis. Vector representations for sentences in the training corpus are learned during network training along with other parameters of the model. Although the network is trained on a frame-by-frame basis, the standard frame-level inputs representing linguistic features are supplemented by features from a projection layer which outputs a learned representation of sentence-level acoustic characteristics. The projection layer contains dedicated parameters for each sentence in the training data which are optimised jointly with the standard network weights. Sentence-specific parameters are optimised on all frames of the relevant sentence -- these parameters therefore allow the network to account for sentence-level variation in the data which is not predictable from the standard linguistic inputs. Results show that the global prosodic characteristics of synthetic speech can be controlled simply and robustly at run time by supplementing basic linguistic features with sentence-level control vectors which are novel but designed to be consistent with those observed in the training corpus.
|Title of host publication||INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association|
|Publisher||International Speech Communication Association|
|Number of pages||5|
|Publication status||Published - 30 Sep 2015|
FingerprintDive into the research topics of 'Sentence-level control vectors for deep neural network speech synthesis'. Together they form a unique fingerprint.
- 1 Finished
Listening test materials for "Sentence-level control vectors for deep neural network speech synthesis"
Watts, O. (Creator), School of Informatics , 10 Jun 2015