Parallel and cascaded deep neural networks for text-to-speech synthesis

Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.
Original languageEnglish
Title of host publication9th ISCA Speech Synthesis Workshop
Pages107-112
Number of pages6
DOIs
Publication statusPublished - 15 Sep 2016
Event9th ISCA Speech Synthesis Workshop - Sunnyvale, United States
Duration: 13 Sep 201615 Sep 2016
http://ssw9.talp.cat/

Conference

Conference9th ISCA Speech Synthesis Workshop
Abbreviated titleISCA 2016
Country/TerritoryUnited States
CitySunnyvale
Period13/09/1615/09/16
Internet address

Fingerprint

Dive into the research topics of 'Parallel and cascaded deep neural networks for text-to-speech synthesis'. Together they form a unique fingerprint.

Cite this