Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis

Xin Wang, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Neural source-filter (NSF) models are deep neural networks that produce waveforms given input acoustic features. They use dilated-convolution-based neural filter modules to filter a sine-based excitation for waveform generation, which is different from WaveNet and flow-based models. One of the NSF models, called harmonic-plus-noise NSF (h-NSF) model, uses separate pairs of source and neural filter to generate harmonic and noise waveform components. It performed as well as WaveNet in terms of speech quality while being superior in generation speed. However, h-NSF may be further improved. While h-NSF merges the harmonic and noise components using pre-defined digital low- and high-pass filters, it is well known that the maximum voice frequency (MVF) that separates the periodic and aperiodic spectral bands are time variant. Therefore, we propose a new h-NSF model with time-variant and trainable MVF. We parameterize the digital low- and high-pass filters as windowed-sinc filters and predict their cut-off-frequency (i.e., MVF) from input acoustic features. Our experiments demonstrated that the new model can predict a good trajectory of MVF. The quality of the generated speech was slightly improved, and the fast generation speed was maintained.
Original languageEnglish
Title of host publicationProceedings of the 10th ISCA Speech Synthesis Workshop
PublisherInternational Speech Communication Association
Pages1-6
Number of pages6
DOIs
Publication statusPublished - 22 Sept 2019
EventThe 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria
Duration: 20 Sept 201922 Sept 2019
Conference number: 10
http://ssw10.oeaw.ac.at/index.html

Publication series

Name
PublisherInternational Speech Communication Association
ISSN (Electronic)2312-2846

Conference

ConferenceThe 10th ISCA Speech Synthesis Workshop
Abbreviated titleSSW 2019
Country/TerritoryAustria
CityVienna
Period20/09/1922/09/19
Internet address

Keywords / Materials (for Non-textual outputs)

  • speech synthesis
  • source-filter model
  • harmonic- pluse-noise waveform model
  • neural network

Fingerprint

Dive into the research topics of 'Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis'. Together they form a unique fingerprint.

Cite this