Robust TTS Duration Modelling Using DNNs

Gustav Henter, Srikanth Ronanki, Oliver Watts, Mirjam Wester, Zhizheng Wu, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the β-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.
Original languageEnglish
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers
Pages5130-5134
Number of pages5
ISBN (Print)978-1-4799-9988-0
DOIs
Publication statusPublished - Mar 2016
Event41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - China, Shanghai, China
Duration: 20 Mar 201625 Mar 2016
https://www2.securecms.com/ICASSP2016/Default.asp

Conference

Conference41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
Abbreviated titleICASSP 2016
Country/TerritoryChina
CityShanghai
Period20/03/1625/03/16
Internet address

Fingerprint

Dive into the research topics of 'Robust TTS Duration Modelling Using DNNs'. Together they form a unique fingerprint.

Cite this