Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis

Xin Wang, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Neural source-filter (NSF) models are deep neural networks that produce waveforms given input acoustic features. They use dilated-convolution-based neural filter modules to filter a sine-based excitation for waveform generation, which is different from WaveNet and flow-based models. One of the NSF models, called harmonic-plus-noise NSF (h-NSF) model, uses separate pairs of source and neural filter to generate harmonic and noise waveform components. It performed as well as WaveNet in terms of speech quality while being superior in generation speed. However, h-NSF may be further improved. While h-NSF merges the harmonic and noise components using pre-defined digital low- and high-pass filters, it is well known that the maximum voice frequency (MVF) that separates the periodic and aperiodic spectral bands are time variant. Therefore, we propose a new h-NSF model with time-variant and trainable MVF. We parameterize the digital low- and high-pass filters as windowed-sinc filters and predict their cut-off-frequency (i.e., MVF) from input acoustic features. Our experiments demonstrated that the new model can predict a good trajectory of MVF. The quality of the generated speech was slightly improved, and the fast generation speed was maintained.
Original languageEnglish
Title of host publicationProceedings of the 10th ISCA Speech Synthesis Workshop
PublisherInternational Speech Communication Association
Number of pages6
Publication statusPublished - 22 Sep 2019
EventThe 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria
Duration: 20 Sep 201922 Sep 2019
Conference number: 10

Publication series

PublisherInternational Speech Communication Association
ISSN (Electronic)2312-2846


ConferenceThe 10th ISCA Speech Synthesis Workshop
Abbreviated titleSSW 2019
Internet address


  • speech synthesis
  • source-filter model
  • harmonic- pluse-noise waveform model
  • neural network


Dive into the research topics of 'Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis'. Together they form a unique fingerprint.

Cite this