Vocal attractiveness of statistical speech synthesisers

S. Andraszewicz, J. Yamagishi, S. King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Our previous analysis of speaker-adaptive HMM-based speech synthesis methods suggested that there are two possible reasons why average voices can obtain higher subjective scores than any individual adapted voice: 1) model adaptation degrades speech quality proportionally to the distance 'moved' by the transforms, and 2) psychoacoustic effects relating to the attractiveness of the voice. This paper is a follow-on from that analysis and aims to separate these effects out. Our latest perceptual experiments focus on attractiveness, using average voices and speaker-dependent voices without model trans formation, and show that using several speakers to create a voice improves smoothness (measured by Harmonics-to-Noise Ratio), reduces distance from the the average voice in the log F0-F1 space of the final voice and hence makes it more attractive at the segmental level. However, this is weakened or overridden at supra-segmental or sentence levels.
Original languageEnglish
Title of host publicationAcoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
Number of pages4
Publication statusPublished - 1 May 2011


Dive into the research topics of 'Vocal attractiveness of statistical speech synthesisers'. Together they form a unique fingerprint.

Cite this