Factorized context modelling for Text-to-Speech synthesis

Heng Lu, S. King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Because speech units are so context-dependent, a large number of linguistic context features are generally used by HMM-based Text-to-Speech (TTS) speech synthesis systems, via context-dependent models. Since it is impossible to train separate models for every context, decision trees are used to discover the most important combinations of features that should be modelled. The task of the decision tree is very hard - to generalize from a very small observed part of the context feature space to the rest - and they have a major weakness: they cannot directly take advantage of factorial properties: they subdivide the model space based on one feature at a time. We propose a Dynamic Bayesian Network (DBN) based Mixed Memory Markov Model (MMMM) to provide factorization of the context space. The results of a listening test are provided as evidence that the model successfully learns the factorial nature of this space.
Original languageEnglish
Title of host publicationAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages5
ISBN (Print)978-1-4799-0356-6
Publication statusPublished - 2013

Keywords / Materials (for Non-textual outputs)

  • belief networks
  • decision trees
  • hidden Markov models
  • linguistics
  • speech synthesis
  • DBN
  • HMM
  • MMMM
  • TTS system
  • context dependent model
  • context features space
  • decision tree
  • dynamic Bayesian network
  • factorized context modelling
  • linguistic context feature
  • mixed memory Markov model
  • text-to-speech synthesis
  • Bayes methods
  • Context
  • Context modeling
  • Hidden Markov models
  • Markov processes
  • Speech
  • Speech synthesis
  • Dynamic Bayesian Network
  • Mixed Memory Markov Model
  • Text-To-Speech synthesis
  • factorized model
  • maximum likelihood parameter generation


Dive into the research topics of 'Factorized context modelling for Text-to-Speech synthesis'. Together they form a unique fingerprint.

Cite this