Investigating Very Deep Highway Networks for Parametric Speech Synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The depth of the neural network is a vital factor that affects its performance. Recently a new architecture called highway network was proposed. This network facilitates the training process of a very deep neural network by using gate units to control a information highway over the conventional hidden layer. For the speech synthesis task, we investigate the performance of highway networks with up to 40 hidden layers. The results suggest that a highway network with 14 non-linear transformation layers is the best choice on our speech corpus and this highway network achieves better performance than a feed-forward network with 14 hidden layers. On the basis of these results, we further investigate a multi-stream highway network where separate highway networks are used to predict different kinds of acoustic features such as the spectral and F0 features. Results
of the experiments suggest that the multi-stream highway network can achieve better objective results than the single network that predicts all the acoustic features. Analysis on the output of highway gate units also supports the assumption for the multi-stream network that different hidden representation may be necessary to predict spectral and F0 features.
Original languageEnglish
Title of host publication9th ISCA Speech Synthesis Workshop
Pages166-171
Number of pages6
DOIs
Publication statusPublished - 15 Sept 2016
Event9th ISCA Speech Synthesis Workshop - Sunnyvale, United States
Duration: 13 Sept 201615 Sept 2016
http://ssw9.talp.cat/

Publication series

Name
ISSN (Print)1234-5678

Conference

Conference9th ISCA Speech Synthesis Workshop
Abbreviated titleISCA 2016
Country/TerritoryUnited States
CitySunnyvale
Period13/09/1615/09/16
Internet address

Fingerprint

Dive into the research topics of 'Investigating Very Deep Highway Networks for Parametric Speech Synthesis'. Together they form a unique fingerprint.

Cite this