Sentence-level control vectors for deep neural network speech synthesis

Oliver Watts, Zhizheng Wu, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper describes the use of a low-dimensional vector representation of sentence acoustics to control the output of a feed-forward deep neural network text-to-speech system on a sentence-by-sentence basis. Vector representations for sentences in the training corpus are learned during network training along with other parameters of the model. Although the network is trained on a frame-by-frame basis, the standard frame-level inputs representing linguistic features are supplemented by features from a projection layer which outputs a learned representation of sentence-level acoustic characteristics. The projection layer contains dedicated parameters for each sentence in the training data which are optimised jointly with the standard network weights. Sentence-specific parameters are optimised on all frames of the relevant sentence -- these parameters therefore allow the network to account for sentence-level variation in the data which is not predictable from the standard linguistic inputs. Results show that the global prosodic characteristics of synthetic speech can be controlled simply and robustly at run time by supplementing basic linguistic features with sentence-level control vectors which are novel but designed to be consistent with those observed in the training corpus.
Original languageEnglish
Title of host publicationINTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association
PublisherInternational Speech Communication Association
Pages2217-2221
Number of pages5
Publication statusPublished - 30 Sept 2015

Fingerprint

Dive into the research topics of 'Sentence-level control vectors for deep neural network speech synthesis'. Together they form a unique fingerprint.

Cite this