Abstract / Description of output
In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as ‘intonation codes’. Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset. Finally, we lay out several methodological issues for evaluating distinct prosodies.
Original language | English |
---|---|
Title of host publication | Proceedings of Speech Prosody 2020 |
Pages | 965-969 |
DOIs | |
Publication status | Published - 24 May 2020 |
Event | Speech Prosody 2020 - University of Tokyo, Tokyo, Japan Duration: 24 May 2020 → 28 May 2020 https://sp2020.jpn.org |
Publication series
Name | |
---|---|
ISSN (Electronic) | 2333-2042 |
Conference
Conference | Speech Prosody 2020 |
---|---|
Country/Territory | Japan |
City | Tokyo |
Period | 24/05/20 → 28/05/20 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- speech synthesis
- prosody
- speech perception
- intonation modelling
- machine learning
- prosodic variation
- discrete representation learning
- variational autoencoder
Fingerprint
Dive into the research topics of 'Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0'. Together they form a unique fingerprint.Profiles
-
Catherine Lai
- School of Philosophy, Psychology and Language Sciences - Lecturer in Speech and Language Processing
- Institute of Language, Cognition and Computation
- Centre for Speech Technology Research
Person: Academic: Research Active