Factors Affecting the Evaluation of Synthetic Speech in Context

Johannah O'Mahony, Pilar Oplustil Gallegos, Catherine Lai, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Text-to-Speech synthesis is approaching the limit of naturalness that is possible from an isolated sentence. The focus of research is shifting to modelling contextual information, typically with the goal of producing better prosodic realisations by accounting for longer-range text dependencies from preceding sentences. But current evaluation methods were developed for single sentences and it is not yet clear how the evaluation of longer texts should be approached. Previous work suggests that evaluation of utterances in context can lead to an increase in Mean Opinion Score ratings, even when the synthesis technique is not context-aware. We investigated several factors that might explain this increase. Three experiments manipulated: the wording of instructions that participants received; the textual characteristics of context-stimulus pairs; and the prosodic realisation of the synthetic speech. We found that the wording of instructions has an impact on listeners’ ratings of stimuli presented in context. The between-sentence context dependency of stimulus text has no impact on ratings. Listeners are, however, sensitive to prosodic differences, both in context and in isolation.
Original languageEnglish
Title of host publicationProc. 11th ISCA Speech Synthesis Workshop (SSW 11)
PublisherInternational Speech Communication Association
Pages148-153
Number of pages6
DOIs
Publication statusPublished - 28 Aug 2021
EventThe 11th ISCA Speech Synthesis Workshop (SSW11) - Gárdony, Hungary
Duration: 26 Aug 202128 Aug 2021
Conference number: 11
https://ssw11.hte.hu

Conference

ConferenceThe 11th ISCA Speech Synthesis Workshop (SSW11)
Abbreviated titleSSW11
Country/TerritoryHungary
CityGárdony
Period26/08/2128/08/21
Internet address

Keywords / Materials (for Non-textual outputs)

  • long-form Text-to-Speech
  • ext-to-Speech evaluation
  • context-aware Text-to-Speech

Fingerprint

Dive into the research topics of 'Factors Affecting the Evaluation of Synthetic Speech in Context'. Together they form a unique fingerprint.

Cite this