An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


A recurrent-neural-network-based F0 model for text-to-speech (TTS) synthesis that generates F0 contours given textual features is proposed. In contrast to related F0 models, the proposed one is designed to learn the temporal correlation of F0 contours at multiple levels. The frame-level correlation is covered by feeding back the F0 output of the previous frame as the additional input of the current frame; meanwhile, the correlation over long-time spans is similarly modeled but by using F0 features aggregated over the phoneme and syllable. Another difference is that the output of the proposed model is not the interpolated continuous-valued F0 contour but rather a sequence of discrete symbols, including quantized F0 levels and a symbol for the unvoiced condition. By using the discrete F0 symbols, the proposed model avoids the influence of artificially interpolated F0 curves. Experiments demonstrated that the proposed F0 model, which was trained using a dropout strategy, generated smooth F0 contours with relatively better perceived quality than those from baseline RNN models.
Original languageEnglish
Title of host publicationProceedings Interspeech 2017
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - 24 Aug 2017
EventInterspeech 2017 - Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017

Publication series

PublisherInternational Speech Communication Association
ISSN (Electronic)1990-9772


ConferenceInterspeech 2017
Internet address

Fingerprint Dive into the research topics of 'An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis'. Together they form a unique fingerprint.

Cite this