Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis

Jaime Lorenzo-Trueba, Gustav Eje Henter, Shinji Takaki, Junichi Yamagishi, Yosuke Morino, Yuta Ochiai

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

In this paper, we investigate the simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions – should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be represented as labels for supervised DNN training, e.g., should emotional class and emotional strength be factorized into separate inputs or not? We evaluate on a large-scale corpus of emotional speech from a professional voice actress, additionally annotated with perceived emotional labels from crowdsourced listeners. By comparing DNN-based speech synthesizers that utilize dierent emotional representations, we assess the impact of these representations and design decisions on human emotion recognition rates, perceived emotional strength, and subjective speech quality. Simultaneously, we also study which representations are most appropriate for controlling the emotional strength of synthetic speech.
Keywords: Emotional speech synthesis, perception modeling, perceptual evaluation
Original languageEnglish
Pages (from-to)135-143
Number of pages9
JournalSpeech Communication
Early online date15 Mar 2018
Publication statusPublished - May 2018


Dive into the research topics of 'Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis'. Together they form a unique fingerprint.

Cite this