Investigating very deep highway networks for parametric speech synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi

Research output: Contribution to journalArticlepeer-review


Deep neural networks are powerful tools for classification and regression tasks. While a network with more than 100 hidden layers has been reported for image classification, how such a non-recurrent neural network with more than 10 hidden layers will perform for speech synthesis is as yet unknown. This work investigates the performance of deep networks on statistical parametric speech synthesis, particularly the question of whether different acoustic features can be better generated by a deeper network. To answer this question, this work examines a multi-stream highway network that separately generates spectral and F0 acoustic features based on the highway architecture. Experiments on the Blizzard Challenge 2011 corpus show that the accuracy of the generated spectral features consistently improves as the depth of the network increases from 2 to 40, but the F0 trajectory can be generated equally well by either a deep or a shallow network. Additional experiments on a single-stream highway and normal feedforward network, both of which generate spectral and F0 features from a single network, show that these networks must be deep enough to generate both kinds of acoustic features well. The difference in the performance of multi- and single-stream highway networks is further analyzed on the basis of the networks’ activation and sensitivity to input features. In general, the highway network with more than 10 hidden layers, either multi- or single-stream, performs better on the experimental corpus than does a shallow network.
Keywords: Text-to-Speech, Statistical parametric speech synthesis, Deep neural network, Highway neural network
Original languageEnglish
Pages (from-to)1-9
Number of pages9
JournalSpeech Communication
Early online date8 Nov 2017
Publication statusE-pub ahead of print - 8 Nov 2017


Dive into the research topics of 'Investigating very deep highway networks for parametric speech synthesis'. Together they form a unique fingerprint.

Cite this