Is there an uncanny valley for speech? Investigating listeners’ evaluations of realistic synthesised voices

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The exploration of uncanny valley effects (UVE) - a distaste for entities that appear almost, but not quite, human - has been a productive topic of research in human-robot interaction. Meanwhile, realistic text-to-speech (TTS) voices are increasingly encountered in various settings. In this work, we aim to describe the relationship between synthesised voices' perceived human-likeness and pleasantness and seek evidence of auditory UVE in listeners’ evaluations. In an online between-subjects experiment, listeners rated an array of manipulated TTS voices, trained using a single speaker’s data. The evidence obtained is compatible with a slight plateau in a generally positive correlation between realism and approval. All the TTS voices used received ratings of below 50% on average for ‘human-likeness’, and therefore conclusions about UVE, i.e. negative reactions to voices perceived as very human-like, cannot be drawn from these data. Our results suggest that, although a correlation exists, high realism may not be necessary for relatively high approval; on average, voices with decreased pitch variation were rated about twice as highly for being ‘pleasant’ and ‘friendly’ as they were ‘like a human’. The relationship between pitch variation and perceived realism is examined and identified as a direction for further research.
Original languageEnglish
Title of host publicationProceedings of Speech Prosody 2024
EditorsYiya Chen, Aoju Chen, Amalia Arvaniti
PublisherInternational Speech Communication Association (ISCA)
Pages1115-1119
Number of pages5
DOIs
Publication statusPublished - 2024
EventSpeech Prosody 2024 - Netherlands, Leiden
Duration: 2 Jul 20245 Jul 2024
https://www.universiteitleiden.nl/sp2024

Publication series

NameSpeech Prosody
PublisherInternational Speech Communication Association (ISCA)
ISSN (Electronic)2333-2042

Conference

ConferenceSpeech Prosody 2024
CityLeiden
Period2/07/245/07/24
Internet address

Keywords / Materials (for Non-textual outputs)

  • speech synthesis
  • speech prosody
  • pitch variation
  • human-computer interaction
  • TTS evaluation

Fingerprint

Dive into the research topics of 'Is there an uncanny valley for speech? Investigating listeners’ evaluations of realistic synthesised voices'. Together they form a unique fingerprint.

Cite this