TY - CONF
T1 - Investigating different representations for modeling multiple emotions in DNN-based speech synthesis
AU - Lorenzo-Trueba, Jaime
AU - Eje Henter, Gustav
AU - Takahashi, Shinji
AU - Yamagishi, Junichi
AU - Morino, Yosuke
AU - Ochiai, Yuta
PY - 2017/7/18
Y1 - 2017/7/18
N2 - This paper investigates simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions – should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be represented as labels for supervised DNN training, e.g., should emotional class and emotional strength be factorized into separate inputs or not? We evaluate on a large-scale corpus of emotional speech from a professional actress, additionally annotated with perceived emotional labels from crowdsourced listeners. By comparing DNN-based speech synthesizers that utilize different emotional representations, we assess the impact of these representations and design decisions on human emotion recognition rates and perceived emotional strength. Index Terms: Emotional speech synthesis, deep neural network, recurrent neural networks
AB - This paper investigates simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions – should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be represented as labels for supervised DNN training, e.g., should emotional class and emotional strength be factorized into separate inputs or not? We evaluate on a large-scale corpus of emotional speech from a professional actress, additionally annotated with perceived emotional labels from crowdsourced listeners. By comparing DNN-based speech synthesizers that utilize different emotional representations, we assess the impact of these representations and design decisions on human emotion recognition rates and perceived emotional strength. Index Terms: Emotional speech synthesis, deep neural network, recurrent neural networks
M3 - Paper
T2 - The 3rd International Workshop on The Affective Social Multimedia Computing 2017
Y2 - 25 August 2017 through 25 August 2017
ER -