TY - GEN
T1 - ADEPT
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Torresquintero, Alexandra
AU - Teh, Tian Huey
AU - Wallis, Christopher G.R.
AU - Staib, Marlene
AU - Ram Mohan, Devang S.
AU - Hu, Vivian
AU - Foglianti, Lorenzo
AU - Gao, Jiameng
AU - King, Simon
PY - 2021/9/3
Y1 - 2021/9/3
N2 - Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it. We introduce a dataset of prosodically-varied reference natural speech samples for evaluating prosody transfer. The samples include global variations reflecting emotion and interpersonal attitude, and local variations reflecting topical emphasis, propositional attitude, syntactic phrasing and marked tonicity. The corpus only includes prosodic variations that listeners are able to distinguish with reasonable accuracy, and we report these figures as a benchmark against which text-to-speech prosody transfer can be compared. We conclude the paper with a demonstration of our proposed evaluation methodology, using the corpus to evaluate two text-to-speech models that perform prosody transfer.
AB - Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it. We introduce a dataset of prosodically-varied reference natural speech samples for evaluating prosody transfer. The samples include global variations reflecting emotion and interpersonal attitude, and local variations reflecting topical emphasis, propositional attitude, syntactic phrasing and marked tonicity. The corpus only includes prosodic variations that listeners are able to distinguish with reasonable accuracy, and we report these figures as a benchmark against which text-to-speech prosody transfer can be compared. We conclude the paper with a demonstration of our proposed evaluation methodology, using the corpus to evaluate two text-to-speech models that perform prosody transfer.
KW - evaluation
KW - TTS prosody transfer
U2 - 10.21437/Interspeech.2021-1610
DO - 10.21437/Interspeech.2021-1610
M3 - Conference contribution
AN - SCOPUS:85119173444
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3351
EP - 3355
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -