Abstract
Neural source-filter (NSF) models are deep neural networks that produce waveforms given input acoustic features. They use dilated-convolution-based neural filter modules to filter a sine-based excitation for waveform generation, which is different from WaveNet and flow-based models. One of the NSF models, called harmonic-plus-noise NSF (h-NSF) model, uses separate pairs of source and neural filter to generate harmonic and noise waveform components. It performed as well as WaveNet in terms of speech quality while being superior in generation speed. However, h-NSF may be further improved. While h-NSF merges the harmonic and noise components using pre-defined digital low- and high-pass filters, it is well known that the maximum voice frequency (MVF) that separates the periodic and aperiodic spectral bands are time variant. Therefore, we propose a new h-NSF model with time-variant and trainable MVF. We parameterize the digital low- and high-pass filters as windowed-sinc filters and predict their cut-off-frequency (i.e., MVF) from input acoustic features. Our experiments demonstrated that the new model can predict a good trajectory of MVF. The quality of the generated speech was slightly improved, and the fast generation speed was maintained.
Original language | English |
---|---|
Title of host publication | Proceedings of the 10th ISCA Speech Synthesis Workshop |
Publisher | International Speech Communication Association |
Pages | 1-6 |
Number of pages | 6 |
DOIs | |
Publication status | Published - 22 Sept 2019 |
Event | The 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria Duration: 20 Sept 2019 → 22 Sept 2019 Conference number: 10 http://ssw10.oeaw.ac.at/index.html |
Publication series
Name | |
---|---|
Publisher | International Speech Communication Association |
ISSN (Electronic) | 2312-2846 |
Conference
Conference | The 10th ISCA Speech Synthesis Workshop |
---|---|
Abbreviated title | SSW 2019 |
Country/Territory | Austria |
City | Vienna |
Period | 20/09/19 → 22/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- speech synthesis
- source-filter model
- harmonic- pluse-noise waveform model
- neural network