Edinburgh Research Explorer

Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

https://www.isca-speech.org/archive/SSW_2019/abstracts/SSW10_O_1-1.html
Original languageEnglish
Title of host publicationProceedings of the 10th ISCA Speech Synthesis Workshop
PublisherInternational Speech Communication Association
Pages1-6
Number of pages6
DOIs
Publication statusPublished - 22 Sep 2019
EventThe 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria
Duration: 20 Sep 201922 Sep 2019
Conference number: 10
http://ssw10.oeaw.ac.at/index.html

Publication series

Name
PublisherInternational Speech Communication Association
ISSN (Electronic)2312-2846

Conference

ConferenceThe 10th ISCA Speech Synthesis Workshop
Abbreviated titleSSW10
CountryAustria
CityVienna
Period20/09/1922/09/19
Internet address

Abstract

Neural source-filter (NSF) models are deep neural networks that produce waveforms given input acoustic features. They use dilated-convolution-based neural filter modules to filter a sine-based excitation for waveform generation, which is different from WaveNet and flow-based models. One of the NSF models, called harmonic-plus-noise NSF (h-NSF) model, uses separate pairs of source and neural filter to generate harmonic and noise waveform components. It performed as well as WaveNet in terms of speech quality while being superior in generation speed. However, h-NSF may be further improved. While h-NSF merges the harmonic and noise components using pre-defined digital low- and high-pass filters, it is well known that the maximum voice frequency (MVF) that separates the periodic and aperiodic spectral bands are time variant. Therefore, we propose a new h-NSF model with time-variant and trainable MVF. We parameterize the digital low- and high-pass filters as windowed-sinc filters and predict their cut-off-frequency (i.e., MVF) from input acoustic features. Our experiments demonstrated that the new model can predict a good trajectory of MVF. The quality of the generated speech was slightly improved, and the fast generation speed was maintained.

    Research areas

  • speech synthesis, source-filter model, harmonic- pluse-noise waveform model, neural network

Event

The 10th ISCA Speech Synthesis Workshop

20/09/1922/09/19

Vienna, Austria

Event: Conference

Download statistics

No data available

ID: 117342931