Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis

Shinji Takaki, SangJin Kim, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In this paper, we investigate the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters. In general, a speaker adaptation technique, e.g., maximum likelihood linear regression (MLLR) for HMMs or learning hidden unit contributions (LHUC) for DNNs, is applied to an acoustic modeling part to change voice characteristics or speaking styles. However, since we have proposed a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, a speaker adaptation technique can be applied not only to the acoustic modeling part but also to other components represented by DNNs. In experiments using a small amount of adaptation data, we performed adaptation based on LHUC and simple additional fine tuning for DNNbased acoustic models, deep auto-encoder based feature extraction, and DNN-based post-filter models and compared them with HMM-based speech synthesis systems using MLLR.
Original languageEnglish
Title of host publication9th ISCA Speech Synthesis Workshop
Number of pages7
Publication statusPublished - 15 Sep 2016
Event9th ISCA Speech Synthesis Workshop - Sunnyvale, United States
Duration: 13 Sep 201615 Sep 2016


Conference9th ISCA Speech Synthesis Workshop
Abbreviated titleISCA 2016
Country/TerritoryUnited States
Internet address


Dive into the research topics of 'Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis'. Together they form a unique fingerprint.

Cite this