Synthesising turn-taking cues using natural conversational data

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance’s realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.
Original languageEnglish
Title of host publicationProceedings of the 12th ISCA Speech Synthesis Workshop
Subtitle of host publication(SSW2023)
EditorsGérard Bailly, Thomas Hueber, Damien Lolive, Nicolas Obin , Olivier Perrotin
Place of PublicationGrenoble
PublisherISCA
Pages75-80
Number of pages5
DOIs
Publication statusPublished - 28 Aug 2023
Event12th ISCA Speech Synthesis Workshop - Grenoble, France
Duration: 26 Aug 202328 Aug 2023
https://ssw2023.org

Publication series

NameProceedings of the ISCA Workshop
PublisherISCA
ISSN (Print)1680-8908

Conference

Conference12th ISCA Speech Synthesis Workshop
Abbreviated titleSSW
Country/TerritoryFrance
CityGrenoble
Period26/08/2328/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • dialogue
  • context-aware TTS
  • turn-taking

Fingerprint

Dive into the research topics of 'Synthesising turn-taking cues using natural conversational data'. Together they form a unique fingerprint.

Cite this