Parallel and cascaded deep neural networks for text-to-speech synthesis

Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.
Original languageEnglish
Title of host publication9th ISCA Speech Synthesis Workshop
Number of pages6
Publication statusPublished - 15 Sept 2016
Event9th ISCA Speech Synthesis Workshop - Sunnyvale, United States
Duration: 13 Sept 201615 Sept 2016


Conference9th ISCA Speech Synthesis Workshop
Abbreviated titleISCA 2016
Country/TerritoryUnited States
Internet address


Dive into the research topics of 'Parallel and cascaded deep neural networks for text-to-speech synthesis'. Together they form a unique fingerprint.

Cite this