Towards minimum perceptual error training for DNN-based speech synthesis

Research output: Chapter in Book/Report/Conference proceedingConference contribution


We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the Spectro-Temporal Excitation Pattern that was originally proposed as part of a model of hearing speech in noise. We train DNNs that predict band aperiodicity, fundamental frequency and Mel cepstral coefficients and compare generated speech when the spectral cost function is defined in the Mel cepstral, warped log spectrum or perceptual domains. Objective results indicate that the perceptual domain system achieves the highest quality.
Original languageEnglish
Title of host publicationINTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association
Place of PublicationDresden
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - Sep 2015
EventInterspeech 2015 - Dresden, Germany
Duration: 6 Sep 20159 Sep 2015


ConferenceInterspeech 2015


Dive into the research topics of 'Towards minimum perceptual error training for DNN-based speech synthesis'. Together they form a unique fingerprint.

Cite this