Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Listening to even high quality text-to-speech - such as that generated by a Deep Neural Network (DNN) driving a vocoder - still requires greater cognitive effort than natural speech, under noisy conditions. Vocoding itself, plus errors in predictions of the vocoder speech parameters by the DNN model are assumed to be responsible. To better understand the contribution of each parameter, we construct a range of systems that vary from copysynthesis (i.e., vocoding) to full text-to-speech generated using a Deep Neural Network system. Each system combines some speech parameters (e.g., spectral envelope) from copy-synthesis with other speech parameters (e.g., F0) predicted from text. Cognitive load was measured using a pupillometry paradigm described in our previous work. Our results reveal the differing contributions that each predicted speech parameter makes to increasing cognitive load.
Original languageEnglish
Title of host publicationProceedings of the 10th ISCA Speech Synthesis Workshop
PublisherInternational Speech Communication Association
Pages121-126
Number of pages6
DOIs
Publication statusPublished - 22 Sept 2019
EventThe 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria
Duration: 20 Sept 201922 Sept 2019
Conference number: 10
http://ssw10.oeaw.ac.at/index.html

Publication series

Name
PublisherInternational Speech Communication Association
ISSN (Electronic)1990-9772

Conference

ConferenceThe 10th ISCA Speech Synthesis Workshop
Abbreviated titleSSW 2019
Country/TerritoryAustria
CityVienna
Period20/09/1922/09/19
Internet address

Keywords / Materials (for Non-textual outputs)

  • text-to-speech
  • deep neural networks
  • cognitive load
  • pupillometry
  • adverse conditions

Fingerprint

Dive into the research topics of 'Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis'. Together they form a unique fingerprint.

Cite this