Abstract
The exploration of uncanny valley effects (UVE) - a distaste for entities that appear almost, but not quite, human - has been a productive topic of research in human-robot interaction. Meanwhile, realistic text-to-speech (TTS) voices are increasingly encountered in various settings. In this work, we aim to describe the relationship between synthesised voices' perceived human-likeness and pleasantness and seek evidence of auditory UVE in listeners’ evaluations. In an online between-subjects experiment, listeners rated an array of manipulated TTS voices, trained using a single speaker’s data. The evidence obtained is compatible with a slight plateau in a generally positive correlation between realism and approval. All the TTS voices used received ratings of below 50% on average for ‘human-likeness’, and therefore conclusions about UVE, i.e. negative reactions to voices perceived as very human-like, cannot be drawn from these data. Our results suggest that, although a correlation exists, high realism may not be necessary for relatively high approval; on average, voices with decreased pitch variation were rated about twice as highly for being ‘pleasant’ and ‘friendly’ as they were ‘like a human’. The relationship between pitch variation and perceived realism is examined and identified as a direction for further research.
Original language | English |
---|---|
Title of host publication | Proceedings of Speech Prosody 2024 |
Editors | Yiya Chen, Aoju Chen, Amalia Arvaniti |
Publisher | International Speech Communication Association (ISCA) |
Pages | 1115-1119 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 2024 |
Event | Speech Prosody 2024 - Netherlands, Leiden Duration: 2 Jul 2024 → 5 Jul 2024 https://www.universiteitleiden.nl/sp2024 |
Publication series
Name | Speech Prosody |
---|---|
Publisher | International Speech Communication Association (ISCA) |
ISSN (Electronic) | 2333-2042 |
Conference
Conference | Speech Prosody 2024 |
---|---|
City | Leiden |
Period | 2/07/24 → 5/07/24 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- speech synthesis
- speech prosody
- pitch variation
- human-computer interaction
- TTS evaluation