Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System

Xin Wang, Shinji Takaki, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Word embedding, which is a dense and low-dimensional vector representation of word, is recently used to replace of the conventional prosodic context as an input feature to the acoustic model of a TTS system. However, these word vectors trained from text data may encode insufficient information related to speech. This paper presents a post-filtering approach to enhance the raw word vectors with prosodic information for the TTS task. Based on a publicly available speech corpus with manual prosodic annotation, a post-filter can be trained to transform the raw word vectors. Experiment shows that using the enhanced word vectors as an input to the neural network-based acoustic model improves the accuracy of the predicted F0 trajectory. Besides, we also show that the enhanced vectors provide better initial values than the raw vectors for error back-propagation of the network, which results in further improvement.
Original languageEnglish
Title of host publicationInterspeech 2016
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - 12 Sep 2016
EventInterspeech 2016 - San Francisco, United States
Duration: 8 Sep 201612 Sep 2016

Publication series

PublisherInternational Speech Communication Association
ISSN (Print)1990-9772


ConferenceInterspeech 2016
Country/TerritoryUnited States
CitySan Francisco
Internet address


Dive into the research topics of 'Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System'. Together they form a unique fingerprint.

Cite this