Projects per year
Abstract
Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the β-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.
Original language | English |
---|---|
Title of host publication | 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 5130-5134 |
Number of pages | 5 |
ISBN (Print) | 978-1-4799-9988-0 |
DOIs | |
Publication status | Published - Mar 2016 |
Event | 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - China, Shanghai, China Duration: 20 Mar 2016 → 25 Mar 2016 https://www2.securecms.com/ICASSP2016/Default.asp |
Conference
Conference | 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 |
---|---|
Abbreviated title | ICASSP 2016 |
Country/Territory | China |
City | Shanghai |
Period | 20/03/16 → 25/03/16 |
Internet address |
Fingerprint
Dive into the research topics of 'Robust TTS Duration Modelling Using DNNs'. Together they form a unique fingerprint.Projects
- 1 Finished