Using Bayesian Networks to find relevant context features for HMM-based speech synthesis

Heng Lu, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Speech units are highly context-dependent, so taking contextual features into account is essential for speech modelling. Context is employed in HMM-based Text-to-Speech speech synthesis systems via context-dependent phone models. A very wide context is taken into account, represented by a large set of contextual factors. However, most of these factors probably have no significant influence on the speech, most of the time. To discover which combinations of features should be taken into account, decision tree-based context clustering is used. But the space of context-dependent models is vast, and the number of contexts seen in the training data is only a tiny fraction of this space, so the task of the decision tree is very hard: to generalise from observations of a tiny fraction of the space to the rest of the space, whilst ignoring uninformative or redundant context features. The structure of the context feature space has not been systematically studied for speech synthesis. In this paper we discover a dependency structure by learning a Bayesian Network over the joint distribution of the features and the speech. We demonstrate that it is possible to discard the majority of context features with minimal impact on quality, measured by a perceptual test.
Original languageEnglish
Title of host publicationINTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012
PublisherISCA-INST SPEECH COMMUNICATION ASSOC
Publication statusPublished - 1 Sep 2012

Fingerprint Dive into the research topics of 'Using Bayesian Networks to find relevant context features for HMM-based speech synthesis'. Together they form a unique fingerprint.

Cite this