Combining a Vector Space Representation of Linguistic Context with a Deep Neural Network for Text-To-Speech Synthesis

Heng Lu, Simon King, Oliver Watts

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Conventional statistical parametric speech synthesis relies on decision trees to cluster together similar contexts, result- ing in tied-parameter context-dependent hidden Markov models (HMMs). However, decision tree clustering has a major weak- ness: it use hard division and subdivides the model space based on one feature at a time, fragmenting the data and failing to exploit interactions between linguistic context features. These linguistic features themselves are also problematic, being noisy and of varied relevance to the acoustics. We propose to combine our previous work on vector-space representations of linguistic context, which have the added ad- vantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input. The outputs of the network are probability distributions over speech features. Maximum Likelihood Parameter Generation is then used to create parameter trajectories, which in turn drive a vocoder to generate the waveform. Various configurations of the system are compared, using both conventional and vector space context representations and with the DNN making speech parameter predictions at two dif- ferent temporal resolutions: frames, or states. Both objective and subjective results are presented.
Original languageEnglish
Title of host publication8th ISCA Speech Synthesis Workshop
Number of pages5
Publication statusPublished - 1 Aug 2013
Event8th ISCA Speech Synthesis Workshop - Barcelona, United Kingdom
Duration: 31 Aug 2013 → …


Conference8th ISCA Speech Synthesis Workshop
Country/TerritoryUnited Kingdom
Period31/08/13 → …


Dive into the research topics of 'Combining a Vector Space Representation of Linguistic Context with a Deep Neural Network for Text-To-Speech Synthesis'. Together they form a unique fingerprint.

Cite this