Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis

Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of high-level units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.
Original languageEnglish
Title of host publicationInterspeech 2016
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - 12 Sept 2016
EventInterspeech 2016 - San Francisco, United States
Duration: 8 Sept 201612 Sept 2016

Publication series

PublisherInternational Speech Communication Association
ISSN (Print)1990-9772


ConferenceInterspeech 2016
Country/TerritoryUnited States
CitySan Francisco
Internet address


Dive into the research topics of 'Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis'. Together they form a unique fingerprint.

Cite this