The NII speech synthesis entry for Blizzard Challenge 2016

Lauri Juvela, Xin Wang, Shinji Takaki, SangJin Kim, Manu Airaksinen, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


This paper decribes the NII speech synthesis entry for Blizzard Challenge 2016, where the task was to build a voice from audiobook data. The synthesis system is built using the NII parametric speech synthesis framework that utilizes Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) for acoustic modeling. For this entry, we first built a voice using a large data set, and then used the audiobook data to adapt the acoustic model to the target speaker. Additionally, the recent fullband glottal vocoder GlottDNN was used in the system with a DNN-based excitation model for generating glottal waveforms. The vocoder estimates the vocal tract in a band-wise manner using Quasi Closed Phase (QCP) inversefiltering at the low-band. At synthesis stage, the excitation
model is used to generate voiced excitation from acoustic features, after which a vocal tract filter is applied to generate synthetic speech.
The Blizzard Challenge listening test results show that the proposed system achieves comparable quality with the benchmark parametric synthesis systems.
Original languageEnglish
Title of host publicationBlizzard Challenge workshop 2016
Number of pages6
Publication statusPublished - 16 Sep 2016
EventBlizzard Challenge 2016 - Cupertino, United States
Duration: 16 Sep 201616 Sep 2016


ConferenceBlizzard Challenge 2016
Country/TerritoryUnited States
Internet address


Dive into the research topics of 'The NII speech synthesis entry for Blizzard Challenge 2016'. Together they form a unique fingerprint.

Cite this