Projects per year
Abstract
For isolated utterances, speech synthesis quality has improved immensely thanks to the use of sequence-to-sequence models. However, these models are generally trained on read speech and fail to generalise to unseen speaking styles. Recently, more re-search is focused on the synthesis of expressive and conversa-tional speech. Conversational speech contains many prosodic phenomena that are not present in read speech. We would like to learn these prosodic patterns from data, but unfortunately, many large conversational corpora are unsuitable for speech synthesis due to low audio quality. We investigate whether a data mixing strategy can improve conversational prosody for a target voice based on monologue data from audiobooks by adding real con-versational data from podcasts. We filter the podcast data to create a set of 26k question and answer pairs. We evaluate two FastPitch models: one trained on 20 hours of monologue speech from a single speaker, and another trained on 5 hours of monologue speech from that speaker plus 15 hours of ques-tions and answers spoken by nearly 15k speakers. Results from three listening tests show that the second model generates more preferred question prosody.
Original language | English |
---|---|
Title of host publication | Proceedings of Interspeech 2022 |
Editors | H. Ko, J.H.L. Hansen |
Publisher | ISCA |
Pages | 3388-3392 |
DOIs | |
Publication status | Published - 22 Sept 2022 |
Publication series
Name | Interspeech - Annual Conference of the International Speech Communication Association |
---|---|
ISSN (Electronic) | 2308-457X |
Keywords / Materials (for Non-textual outputs)
- conversational speech synthesis
- speech synthesis
- expressive speech synthesis
Fingerprint
Dive into the research topics of 'Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis'. Together they form a unique fingerprint.Projects
- 1 Finished