Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning

Qiong Hu, Zhizheng Wu, Korin Richmond, Junichi Yamagishi, Yannis Stylianou, Ranniery Maia

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

It has recently been shown that deep neural networks (DNN) can improve the quality of statistical parametric speech synthesis (SPSS) when using a source-filter vocoder. Our own previous work has furthermore shown that a dynamic sinusoidal model (DSM) is also highly suited to DNN-based SPSS, whereby sinusoids may either be used themselves as a "direct parameterisation" (DIR), or they may be encoded using an "intermediate spectral parameterisation" (INT). The approach in that work was effectively to replace a decision tree with a neura l network. However, waveform parameterisation and synthesis steps that have been developed to suit HMMs may not fully exploit DNN capabilities. Here, in contrast, we investigate ways to combine INT and DIR at the levels of both DNN modelling and waveform generation. For DNN training, we propose to use multi-task learning to model cepstra (from INT) and log amplitudes(from DIR) as primary and secondary tasks. Our results show combining these improves modelling accuracy for both tasks. Next, during synthesis, instead of discarding parameters from the second task, a fusion method using harmonic amplitudes derived from both tasks is applied. Preference tests show the proposed method gives improved performance, and that this applies to synthesising both with and without global variance parameters.
Original languageEnglish
Title of host publicationINTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association
PublisherInternational Speech Communication Association
Pages854-858
Number of pages5
Publication statusPublished - Sept 2015

Fingerprint

Dive into the research topics of 'Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning'. Together they form a unique fingerprint.

Cite this