Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis

Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In statistical parametric speech synthesis (SPSS) systems using the high-quality vocoder, acoustic features such as melcepstrum coefficients and F0 are predicted from linguistic features in order to utilize the vocoder to generate speech waveforms. However, the generated speech waveform generally suffers from quality deterioration such as buzziness caused by utilizing the vocoder. Although several attempts such as improving an excitation model have been investigated to alleviate the problem, it is difficult to completely avoid it if the SPSS system is based on the vocoder. To overcome this problem, there have recently been attempts to directly model waveform samples. Superior performance has been demonstrated, but computation time and latency are still issues. With the aim to construct another type of DNN-based speech synthesizer with neither the vocoder nor computational explosion, we investigated direct modeling of frequency spectra and waveform generation based on phase recovery. In this framework, STFT spectral amplitudes that include harmonics information derived from F0 are directly predicted through a DNN-based acoustic model and we use Griffin and Lim’s approach to recover phase and generate waveforms. The experimental results showed that the proposed system synthesized speech without buzziness and outperformed one generated from a conventional system using the vocoder.
Original languageEnglish
Title of host publicationProceedings Interspeech 2017
PublisherInternational Speech Communication Association
Pages1128-1132
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2017
EventInterspeech 2017 - Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017
http://www.interspeech2017.org/

Publication series

NameInterspeech
PublisherInternational Speech Commication Association
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech 2017
CountrySweden
CityStockholm
Period20/08/1724/08/17
Internet address

Fingerprint Dive into the research topics of 'Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis'. Together they form a unique fingerprint.

Cite this