Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System

Xin Wang, Shinji Takaki, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Word embedding, which is a dense and low-dimensional vector representation of word, is recently used to replace of the conventional prosodic context as an input feature to the acoustic model of a TTS system. However, these word vectors trained from text data may encode insufficient information related to speech. This paper presents a post-filtering approach to enhance the raw word vectors with prosodic information for the TTS task. Based on a publicly available speech corpus with manual prosodic annotation, a post-filter can be trained to transform the raw word vectors. Experiment shows that using the enhanced word vectors as an input to the neural network-based acoustic model improves the accuracy of the predicted F0 trajectory. Besides, we also show that the enhanced vectors provide better initial values than the raw vectors for error back-propagation of the network, which results in further improvement.
Original languageEnglish
Title of host publicationInterspeech 2016
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - 12 Sep 2016
EventInterspeech 2016 - San Francisco, United States
Duration: 8 Sep 201612 Sep 2016

Publication series

PublisherInternational Speech Communication Association
ISSN (Print)1990-9772


ConferenceInterspeech 2016
CountryUnited States
CitySan Francisco
Internet address

Fingerprint Dive into the research topics of 'Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System'. Together they form a unique fingerprint.

Cite this