Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis

Johannah O'Mahony, Catherine Lai, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

For isolated utterances, speech synthesis quality has improved immensely thanks to the use of sequence-to-sequence models. However, these models are generally trained on read speech and fail to generalise to unseen speaking styles. Recently, more re-search is focused on the synthesis of expressive and conversa-tional speech. Conversational speech contains many prosodic phenomena that are not present in read speech. We would like to learn these prosodic patterns from data, but unfortunately, many large conversational corpora are unsuitable for speech synthesis due to low audio quality. We investigate whether a data mixing strategy can improve conversational prosody for a target voice based on monologue data from audiobooks by adding real con-versational data from podcasts. We filter the podcast data to create a set of 26k question and answer pairs. We evaluate two FastPitch models: one trained on 20 hours of monologue speech from a single speaker, and another trained on 5 hours of monologue speech from that speaker plus 15 hours of ques-tions and answers spoken by nearly 15k speakers. Results from three listening tests show that the second model generates more preferred question prosody.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2022
EditorsH. Ko, J.H.L. Hansen
PublisherISCA
Pages3388-3392
DOIs
Publication statusPublished - 22 Sept 2022

Publication series

Name Interspeech - Annual Conference of the International Speech Communication Association
ISSN (Electronic)2308-457X

Keywords / Materials (for Non-textual outputs)

  • conversational speech synthesis
  • speech synthesis
  • expressive speech synthesis

Fingerprint

Dive into the research topics of 'Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis'. Together they form a unique fingerprint.

Cite this