Voice source modelling using deep neural networks for statistical parametric speech synthesis

Tuomo Raitio, Heng Lu, John Kane, Antti Suni, Martti Vainio, Simon King, Paavo Alku

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a voice source modelling method employing a deep neural network (DNN) to map from acoustic features to the time-domain glottal flow waveform. First, acoustic features and the glottal flow signal are estimated from each frame of the speech database. Pitch-synchronous glottal flow time-domain waveforms are extracted, interpolated to a constant duration, and stored in a codebook. Then, a DNN is trained to map from acoustic features to these duration-normalised glottal waveforms. At synthesis time, acoustic features are generated froma statistical parametricmodel, and from these, the trained DNN predicts the glottal flow waveform. Illustrations are provided to demonstrate that the proposed method successfully synthesises the glottal flow waveform and enables easy modification of the waveform by adjusting the input values to the DNN. In a subjective listening test, the proposed method was rated as equal to a high-quality method employing a stored glottal flow waveform.

Original languageEnglish
Title of host publicationEuropean Signal Processing Conference
PublisherEuropean Signal Processing Conference, EUSIPCO
Pages2290-2294
Number of pages5
ISBN (Print)9780992862619
Publication statusPublished - 1 Sept 2014
Event22nd European Signal Processing Conference, EUSIPCO 2014 - Lisbon, United Kingdom
Duration: 1 Sept 20145 Sept 2014

Conference

Conference22nd European Signal Processing Conference, EUSIPCO 2014
Country/TerritoryUnited Kingdom
CityLisbon
Period1/09/145/09/14

Keywords / Materials (for Non-textual outputs)

  • Deep neural network
  • DNN
  • glottal flow
  • statistical parametric speech synthesis
  • voice source modelling

Fingerprint

Dive into the research topics of 'Voice source modelling using deep neural networks for statistical parametric speech synthesis'. Together they form a unique fingerprint.

Cite this